t/porting/podcheck.t

   1 #!/usr/bin/perl -w
   2
   3 package main;
   4
   5 BEGIN {
   6     chdir 't';
   7     @INC = "../lib";
   8     # Do not require test.pl, this file has its own framework.
   9 }
  10
  11 use strict;
  12 use warnings;
  13 use feature 'unicode_strings';
  14
  15 use Carp;
  16 use Config;
  17 use Digest;
  18 use File::Find;
  19 use File::Spec;
  20 use Scalar::Util;
  21 use Text::Tabs;
  22
  23 BEGIN {
  24     if ( $Config{usecrosscompile} ) {
  25         print "1..0 # Not all files are available during cross-compilation\n";
  26         exit 0;
  27     }
  28     if ($^O eq 'dec_osf') {
  29         print "1..0 # $^O cannot handle this test\n";
  30         exit(0);
  31     }
  32     require '../regen/regen_lib.pl';
  33 }
  34
  35 sub DEBUG { 0 };
  36
  37 =pod
  38
  39 =head1 NAME
  40
  41 podcheck.t - Look for possible problems in the Perl pods
  42
  43 =head1 SYNOPSIS
  44
  45  cd t
  46  ./perl -I../lib porting/podcheck.t [--show_all] [--cpan] [--deltas]
  47                                     [--counts] [--pedantic] [FILE ...]
  48
  49  ./perl -I../lib porting/podcheck.t --add_link MODULE ...
  50
  51  ./perl -I../lib porting/podcheck.t --regen
  52
  53 =head1 DESCRIPTION
  54
  55 podcheck.t is an extension of Pod::Checker.  It looks for pod errors and
  56 potential errors in the files given as arguments, or if none specified, in all
  57 pods in the distribution workspace, except certain known special ones
  58 (specified below).  It does additional checking beyond that done by
  59 Pod::Checker, and keeps a database of known potential problems, and will
  60 fail a pod only if the number of such problems differs from that given in the
  61 database.
  62
  63 The additional checks it always makes are:
  64
  65 =over
  66
  67 =item Cross-pod link checking
  68
  69 Pod::Checker verifies that links to an internal target in a pod are not
  70 broken.  podcheck.t extends that (when called without FILE arguments) to
  71 external links.  It does this by gathering up all the possible targets in the
  72 workspace, and cross-checking them.  It also checks that a non-broken link
  73 points to just one target.  (The destination pod could have two targets with
  74 the same name.)
  75
  76 The way that the C<LE<lt>E<gt>> pod command works (for links outside the pod)
  77 is to actually create a link to C<search.cpan.org> with an embedded query for
  78 the desired pod or man page.  That means that links outside the distribution
  79 are valid.  podcheck.t doesn't verify the validity of such links, but instead
  80 keeps a database of those known to be valid.  This means that if a link to a
  81 target not on the list is created, the target needs to be added to the data
  82 base.  This is accomplished via the L<--add_link|/--add_link MODULE ...>
  83 option to podcheck.t, described below.
  84
  85 =item An internal link that isn't so specified
  86
  87 If a link is broken, but there is an existing internal target of the same
  88 name, it is likely that the internal target was meant, and the C<"/"> is
  89 missing from the C<LE<lt>E<gt>> pod command.
  90
  91 =item Missing or duplicate NAME or missing NAME short description
  92
  93 A pod can't be linked to unless it has a unique name.
  94 And a NAME should have a dash and short description after it.
  95
  96 If the C<PERL_POD_PEDANTIC> environment variable is set or the C<--pedantic>
  97 command line argument is provided then a few more checks are made.
  98 The pedantic checks are:
  99
 100 =over
 101
 102 =item Verbatim paragraphs that wrap in an 80 (including 1 spare) column window
 103
 104 It's annoying to have lines wrap when displaying pod documentation in a
 105 terminal window.  This checks that all verbatim lines fit in a standard 80
 106 column window, even when using a pager that reserves a column for its own use.
 107 (Thus the check is for a net of 79 columns.)
 108 For those lines that don't fit, it tells you how much needs to be cut in
 109 order to fit.
 110
 111 Often, the easiest thing to do to gain space for these is to lower the indent
 112 to just one space.
 113
 114 =item Items that perhaps should be links
 115
 116 There are mentions of apparent files in the pods that perhaps should be links
 117 instead, using C<LE<lt>...E<gt>>
 118
 119 =item Items that perhaps should be C<FE<lt>...E<gt>>
 120
 121 What look like path names enclosed in C<CE<lt>...E<gt>> should perhaps have
 122 C<FE<lt>...E<gt>> mark-up instead.
 123
 124 =back
 125
 126 A number of issues raised by podcheck.t and by the base Pod::Checker are not
 127 really problems, but merely potential problems, that is, false positives.
 128 After inspecting them and
 129 deciding that they aren't real problems, it is possible to shut up this program
 130 about them, unlike base Pod::Checker.  For a valid link to an outside module
 131 or man page, call podcheck.t with the C<--add_link> option to add it to the
 132 the database of known links; for other causes, call podcheck.t with the C<--regen>
 133 option to regenerate the entire database.  This tells it that all existing
 134 issues are to not be mentioned again.
 135
 136 C<--regen> isn't fool-proof.  The database merely keeps track of the number of these
 137 potential problems of each type for each pod.  If a new problem of a given
 138 type is introduced into the pod, podcheck.t will spit out all of them.  You
 139 then have to figure out which is the new one, and should it be changed or not.
 140 But doing it this way insulates the database from having to keep track of line
 141 numbers of problems, which may change, or the exact wording of each problem
 142 which might also change without affecting whether it is a problem or not.
 143
 144 Also, if the count of potential problems of a given type for a pod decreases,
 145 the database must be regenerated so that it knows the new number.  The program
 146 gives instructions when this happens.
 147
 148 Some pods will have varying numbers of problems of a given type.  This can
 149 be handled by manually editing the database file (see L</FILES>), and setting
 150 the number of those problems for that pod to a negative number.  This will
 151 cause the corresponding error to always be suppressed no matter how many there
 152 actually are.
 153
 154 Another problem is that there is currently no check that modules listed as
 155 valid in the database
 156 actually are.  Thus any errors introduced there will remain there.
 157
 158 =back
 159
 160 =head2 Specially handled pods
 161
 162 =over
 163
 164 =item perltoc
 165
 166 This pod is generated by pasting bits from other pods.  Errors in those bits
 167 will show up as errors here, as well as for those other pods.  Therefore
 168 errors here are suppressed, and the pod is checked only to verify that nodes
 169 within it actually exist that are externally linked to.
 170
 171 =item perldelta
 172
 173 The current perldelta pod is initialized from a template that contains
 174 placeholder text.  Some of this text is in the form of links that don't really
 175 exist.  Any such links that are listed in C<@perldelta_ignore_links> will not
 176 generate messages.  It is presumed that these links will be cleaned up when
 177 the perldelta is cleaned up for release since they should be marked with
 178 C<XXX>.
 179
 180 =item Porting/perldelta_template.pod
 181
 182 This is not a pod, but a template for C<perldelta>.  Any errors introduced
 183 here will show up when C<perldelta> is created from it.
 184
 185 =item cpan-upstream pods
 186
 187 See the L</--cpan> option documentation
 188
 189 =item old perldeltas
 190
 191 See the L</--deltas> option documentation
 192
 193 =back
 194
 195 =head1 OPTIONS
 196
 197 =over
 198
 199 =item --add_link MODULE ...
 200
 201 Use this option to teach podcheck.t that the C<MODULE>s or man pages actually
 202 exist, and to silence any messages that links to them are broken.
 203
 204 podcheck.t checks that links within the Perl core distribution are valid, but
 205 it doesn't check links to man pages or external modules.  When it finds
 206 a broken link, it checks its database of external modules and man pages,
 207 and only if not found there does it raise a message.  This option just adds
 208 the list of modules and man page references that follow it on the command line
 209 to that database.
 210
 211 For example,
 212
 213     cd t
 214     ./perl -I../lib porting/podcheck.t --add_link Unicode::Casing
 215
 216 causes the external module "Unicode::Casing" to be added to the database, so
 217 C<LE<lt>Unicode::CasingE<gt>> will be considered valid.
 218
 219 =item --regen
 220
 221 Regenerate the database used by podcheck.t to include all the existing
 222 potential problems.  Future runs of the program will not then flag any of
 223 these.  Setting this option also sets C<--pedantic>.
 224
 225 =item --cpan
 226
 227 Normally, all pods in the cpan directory are skipped, except to make sure that
 228 any blead-upstream links to such pods are valid.
 229 This option will cause cpan upstream pods to be fully checked.
 230
 231 =item --deltas
 232
 233 Normally, all old perldelta pods are skipped, except to make sure that
 234 any links to such pods are valid.  This is because they are considered
 235 stable, and perhaps trying to fix them will cause changes that will
 236 misrepresent Perl's history.  But, this option will cause them to be fully
 237 checked.
 238
 239 =item --show_all
 240
 241 Normally, if the number of potential problems of a given type found for a
 242 pod matches the expected value in the database, they will not be displayed.
 243 This option forces the database to be ignored during the run, so all potential
 244 problems are displayed and will fail their respective pod test.  Specifying
 245 any particular FILES to operate on automatically selects this option.
 246
 247 =item --counts
 248
 249 Instead of testing, this just dumps the counts of the occurrences of the
 250 various types of potential problems in the database.
 251
 252 =item --pedantic
 253
 254 There are three potential problems that are not checked for by default.
 255 This options enables them. The environment variable C<PERL_POD_PEDANTIC>
 256 can be set to 1 to enable this option also.
 257 This option is set when C<--regen> is used.
 258
 259 =back
 260
 261 =head1 FILES
 262
 263 The database is stored in F<t/porting/known_pod_issues.dat>
 264
 265 =head1 SEE ALSO
 266
 267 L<Pod::Checker>
 268
 269 =cut
 270
 271 # VMS builds have a '.com' appended to utility and script names, and it adds a
 272 # trailing dot for any other file name that doesn't have a dot in it.  The db
 273 # is stored without those things.  This regex allows for these special file
 274 # names to be dealt with.  It needs to be interpolated into a larger regex
 275 # that furnishes the closing boundary.
 276 my $vms_re = qr/ \. (?: com )? /x;
 277
 278 # Some filenames in the MANIFEST match $vms_re, and so must not be handled the
 279 # same way that that the special vms ones are.  This hash lists those.
 280 my %special_vms_files;
 281
 282 # This is to get this to work across multiple file systems, including those
 283 # that are not case sensitive.  The db is stored in lower case, Un*x style,
 284 # and all file name comparisons are done that way.
 285 sub canonicalize($) {
 286     my $input = shift;
 287     my ($volume, $directories, $file)
 288                     = File::Spec->splitpath(File::Spec->canonpath($input));
 289     # Assumes $volume is constant for everything in this directory structure
 290     $directories = "" if ! $directories;
 291     $file = "" if ! $file;
 292     $file = lc join '/', File::Spec->splitdir($directories), $file;
 293     $file =~ s! / /+ !/!gx;       # Multiple slashes => single slash
 294
 295     # The db is stored without the special suffixes that are there in VMS, so
 296     # strip them off to get the comparable name.  But some files on all
 297     # platforms have these suffixes, so this shouldn't happen for them, as any
 298     # of their db entries will have the suffixes in them.  The hash has been
 299     # populated with these files.
 300     if ($^O eq 'VMS'
 301         && $file =~ / ( $vms_re ) $ /x
 302         && ! exists $special_vms_files{$file})
 303     {
 304         $file =~ s/ $1 $ //x;
 305     }
 306     return $file;
 307 }
 308
 309 #####################################################
 310 # HOW IT WORKS (in general)
 311 #
 312 # If not called with specific files to check, the directory structure is
 313 # examined for files that have pods in them.  Files that might not have to be
 314 # fully parsed (e.g. in cpan) are parsed enough at this time to find their
 315 # pod's NAME, and to get a checksum.
 316 #
 317 # Those kinds of files are sorted last, but otherwise the pods are parsed with
 318 # the package coded here, My::Pod::Checker, which is an extension to
 319 # Pod::Checker that adds some tests and suppresses others that aren't
 320 # appropriate.  The latter module has no provision for capturing diagnostics,
 321 # so a package, Tie_Array_to_FH, is used to force them to be placed into an
 322 # array instead of printed.
 323 #
 324 # Parsing the files builds up a list of links.  The files are gone through
 325 # again, doing cross-link checking and outputting all saved-up problems with
 326 # each pod.
 327 #
 328 # Sorting the files last that potentially don't need to be fully parsed allows
 329 # us to not parse them unless there is a link to an internal anchor in them
 330 # from something that we have already parsed.  Keeping checksums allows us to
 331 # not parse copies of other pods.
 332 #
 333 #####################################################
 334
 335 # 1 => Exclude low priority messages that aren't likely to be problems, and
 336 # has many false positives; higher numbers give more messages.
 337 my $Warnings_Level = 200;
 338
 339 # perldelta during construction may have place holder links.  N.B.  This
 340 # variable is referred to by name in release_managers_guide.pod
 341 our @perldelta_ignore_links = ( "XXX", "perl5YYYdelta", "perldiag/message" );
 342
 343 # To see if two pods with the same NAME are actually copies of the same pod,
 344 # which is not an error, it uses a checksum to save work.
 345 my $digest_type = "SHA-1";
 346
 347 my $original_dir = File::Spec->rel2abs(File::Spec->curdir);
 348 my $data_dir = File::Spec->catdir($original_dir, 'porting');
 349 my $known_issues = File::Spec->catfile($data_dir, 'known_pod_issues.dat');
 350 my $MANIFEST = File::Spec->catfile(File::Spec->updir($original_dir), 'MANIFEST');
 351 my $copy_fh;
 352
 353 my $MAX_LINE_LENGTH = 79;   # 79 columns
 354 my $INDENT = 7;             # default nroff indent
 355
 356 # Our warning messages.  Better not have [('"] in them, as those are used as
 357 # delimiters for variable parts of the messages by poderror.
 358 my $broken_link = "Apparent broken link";
 359 my $broken_internal_link = "Apparent internal link is missing its forward slash";
 360 my $multiple_targets = "There is more than one target";
 361 my $duplicate_name = "Pod NAME already used";
 362 my $no_name = "There is no NAME";
 363 my $missing_name_description = "The NAME should have a dash and short description after it";
 364 # the pedantic warnings messages
 365 my $line_length = "Verbatim line length including indents exceeds $MAX_LINE_LENGTH by";
 366 my $C_not_linked = "? Should you be using L<...> instead of";
 367 my $C_with_slash = "? Should you be using F<...> or maybe L<...> instead of";
 368
 369 # objects, tests, etc can't be pods, so don't look for them. Also skip
 370 # files output by the patch program.  Could also ignore most of .gitignore
 371 # files, but not all, so don't.
 372
 373 my $obj_ext = $Config{'obj_ext'}; $obj_ext =~ tr/.//d; # dot will be added back
 374 my $lib_ext = $Config{'lib_ext'}; $lib_ext =~ tr/.//d;
 375 my $lib_so  = $Config{'so'};      $lib_so  =~ tr/.//d;
 376 my $dl_ext  = $Config{'dlext'};   $dl_ext  =~ tr/.//d;
 377
 378 # Not really pods, but can look like them.
 379 my %excluded_files = (
 380                         canonicalize("lib/unicore/mktables") => 1,
 381                         canonicalize("Porting/make-rmg-checklist") => 1,
 382                         canonicalize("Porting/perldelta_template.pod") => 1,
 383                         canonicalize("regen/feature.pl") => 1,
 384                         canonicalize("regen/warnings.pl") => 1,
 385                         canonicalize("autodoc.pl") => 1,
 386                         canonicalize("configpm") => 1,
 387                         canonicalize("miniperl") => 1,
 388                         canonicalize("perl") => 1,
 389                         canonicalize('cpan/Pod-Perldoc/corpus/no-head.pod') => 1,
 390                         canonicalize('cpan/Pod-Perldoc/corpus/perlfunc.pod') => 1,
 391                         canonicalize('cpan/Pod-Perldoc/corpus/utf8.pod') => 1,
 392                         canonicalize("lib/unicore/mktables") => 1,
 393                     );
 394
 395 # This list should not include anything for which case sensitivity is
 396 # important, as it won't work on VMS, and won't show up until tested on VMS.
 397 # All or almost all such files should be listed in the MANIFEST, so that can
 398 # be examined for them, and each such file explicitly excluded, as is done for
 399 # .PL files in the loop just below this.  For files not catchable this way,
 400 # is_pod_file() can be used to exclude these at a finer grained level.
 401 my $non_pods = qr/ (?: \.
 402                        (?: [achot]  | zip | gz | bz2 | jar | tar | tgz
 403                            | orig | rej | patch   # Patch program output
 404                            | sw[op] | \#.*  # Editor droppings
 405                            | old      # buildtoc output
 406                            | xs       # pod should be in the .pm file
 407                            | al       # autosplit files
 408                            | bs       # bootstrap files
 409                            | (?i:sh)  # shell scripts, hints, templates
 410                            | lst      # assorted listing files
 411                            | bat      # Windows,Netware,OS2 batch files
 412                            | cmd      # Windows,Netware,OS2 command files
 413                            | lis      # VMS compiler listings
 414                            | map      # VMS linker maps
 415                            | opt      # VMS linker options files
 416                            | mms      # MM(K|S) description files
 417                            | ts       # timestamp files generated during build
 418                            | $obj_ext # object files
 419                            | exe      # $Config{'exe_ext'} might be empty string
 420                            | $lib_ext # object libraries
 421                            | $lib_so  # shared libraries
 422                            | $dl_ext  # dynamic libraries
 423                            | gif      # GIF images (example files from CGI.pm)
 424                            | eg       # examples from libnet
 425                            | core
 426                        )
 427                        $
 428                     ) | ~$ | \ \(Autosaved\)\.txt$ # Other editor droppings
 429                            | ^cxx\$demangler_db\.$ # VMS name mangler database
 430                            | ^typemap\.?$          # typemap files
 431                            | ^(?i:Makefile\.PL)$
 432                 /x;
 433
 434 # Matches something that looks like a file name, but is enclosed in C<...>
 435 my $C_path_re = qr{ ^
 436                         # exclude various things that have slashes
 437                         # in them but aren't paths
 438                         (?!
 439                             (?: (?: s | qr | m | tr | y ) / ) # regexes
 440                             | \d+/\d+ \b       # probable fractions
 441                             | (?: [LF] < )+
 442                             | OS/2 \b
 443                             | Perl/Tk \b
 444                             | origin/blead \b
 445                             | origin/maint \b
 446
 447                         )
 448                         /?  # Optional initial slash
 449                         \w+ # First component of path, doesn't begin with
 450                             # a minus
 451                         (?: / [-\w]+ )+ # Subsequent path components
 452                         (?: \. \w+ )?   # Optional trailing dot and suffix
 453                         >*  # Any enclosed L< F< have matching closing >
 454                         $
 455                     }x;
 456
 457 # '.PL' files should be excluded, as they aren't final pods, but often contain
 458 # material used in generating pods, and so can look like a pod.  We can't use
 459 # the regexp above because case sensisitivity is important for these, as some
 460 # '.pl' files should be examined for pods.  Instead look through the MANIFEST
 461 # for .PL files and get their full path names, so we can exclude each such
 462 # file explicitly.  This works because other porting tests prohibit having two
 463 # files with the same names except for case.
 464 open my $manifest_fh, '<:bytes', $MANIFEST or die "Can't open $MANIFEST";
 465 while (<$manifest_fh>) {
 466
 467     # While we have MANIFEST open, on VMS platforms, look for files that match
 468     # the magic VMS file names that have to be handled specially.  Add these
 469     # to the list of them.
 470     if ($^O eq 'VMS' && / ^ ( [^\t]* $vms_re ) \t /x) {
 471         $special_vms_files{$1} = 1;
 472     }
 473     if (/ ^ ( [^\t]* \. PL ) \t /x) {
 474         $excluded_files{canonicalize($1)} = 1;
 475     }
 476 }
 477 close $manifest_fh, or die "Can't close $MANIFEST";
 478
 479
 480 # Pod::Checker messages to suppress
 481 my @suppressed_messages = (
 482     # We catch independently the ones that are real problems.
 483     qr/multiple occurrences \(\d+\) of link target/,
 484
 485     "unescaped <>",                 # Not every '<' or '>' need be escaped
 486     qr/No items in =over/,          # i.e., a blockquote, which we consider legal
 487 );
 488
 489 sub suppressed {
 490     # Returns bool as to if input message is one that is to be suppressed
 491
 492     my $message = shift;
 493
 494     return grep { $message =~ /^$_/i } @suppressed_messages;
 495 }
 496
 497 {   # Closure to contain a simple subset of test.pl.  This is to get rid of the
 498     # unnecessary 'failed at' messages that would otherwise be output pointing
 499     # to a particular line in this file.
 500
 501     my $current_test = 0;
 502     my $planned;
 503
 504     sub plan {
 505         my %plan = @_;
 506         $planned = $plan{tests} + 1;    # +1 for final test that files haven't
 507                                         # been removed
 508         print "1..$planned\n";
 509         return;
 510     }
 511
 512     sub ok {
 513         my $success = shift;
 514         my $message = shift;
 515
 516         chomp $message;
 517
 518         $current_test++;
 519         print "not " unless $success;
 520         print "ok $current_test - $message\n";
 521         return $success;
 522     }
 523
 524     sub skip {
 525         my $why = shift;
 526         my $n    = @_ ? shift : 1;
 527         for (1..$n) {
 528             $current_test++;
 529             print "ok $current_test # skip $why\n";
 530         }
 531         no warnings 'exiting';
 532         last SKIP;
 533     }
 534
 535     sub _note {
 536         my ($andle, $message) = @_;
 537
 538         chomp $message;
 539
 540         print $andle $message =~ s/^/# /mgr;
 541         print $andle "\n";
 542         return;
 543     }
 544
 545     sub note { unshift @_, \*STDOUT; goto &_note }
 546
 547     sub diag { unshift @_, \*STDERR; goto &_note }
 548
 549     END {
 550         if ($planned && $planned != $current_test) {
 551             print STDERR
 552             "# Looks like you planned $planned tests but ran $current_test.\n";
 553         }
 554     }
 555 }
 556
 557 # List of known potential problems by pod and type.
 558 my %known_problems;
 559
 560 # Pods given by the keys contain an interior node that is referred to from
 561 # outside it.
 562 my %has_referred_to_node;
 563
 564 my $show_counts = 0;
 565 my $regen = 0;
 566 my $add_link = 0;
 567 my $show_all = 0;
 568 my $pedantic = 0;
 569
 570 my $do_upstream_cpan = 0; # Assume that are to skip anything in /cpan
 571 my $do_deltas = 0;        # And stable perldeltas
 572
 573 while (@ARGV && substr($ARGV[0], 0, 1) eq '-') {
 574     my $arg = shift @ARGV;
 575
 576     $arg =~ s/^--/-/; # Treat '--' the same as a single '-'
 577     if ($arg eq '-regen') {
 578         $regen = 1;
 579         $pedantic = 1;
 580     }
 581     elsif ($arg eq '-add_link') {
 582         $add_link = 1;
 583     }
 584     elsif ($arg eq '-cpan') {
 585         $do_upstream_cpan = 1;
 586     }
 587     elsif ($arg eq '-deltas') {
 588         $do_deltas = 1;
 589     }
 590     elsif ($arg eq '-show_all') {
 591         $show_all = 1;
 592     }
 593     elsif ($arg eq '-counts') {
 594         $show_counts = 1;
 595     }
 596     elsif ($arg eq '-pedantic') {
 597         $pedantic = 1;
 598     }
 599     else {
 600         die <<EOF;
 601 Unknown option '$arg'
 602
 603 Usage: $0 [ --regen | --cpan | --show_all | FILE ... | --add_link MODULE ... ]\n"
 604     --add_link -> Add the MODULE and man page references to the database
 605     --regen    -> Regenerate the data file for $0
 606     --cpan     -> Include files in the cpan subdirectory.
 607     --deltas   -> Include stable perldeltas
 608     --show_all -> Show all known potential problems
 609     --counts   -> Don't test, but give summary counts of the currently
 610                   existing database
 611     --pedantic -> Check for overly long lines in verbatim blocks
 612 EOF
 613     }
 614 }
 615
 616 $pedantic = 1 if exists $ENV{PERL_POD_PEDANTIC} and $ENV{PERL_POD_PEDANTIC};
 617 my @files = @ARGV;
 618
 619 my $cpan_or_deltas = $do_upstream_cpan || $do_deltas;
 620 if (($regen + $show_all + $show_counts + $add_link + $cpan_or_deltas ) > 1) {
 621     croak "--regen, --show_all, --counts, and --add_link are mutually exclusive\n and none can be run with --cpan nor --deltas";
 622 }
 623
 624 my $has_input_files = @files;
 625
 626
 627 if ($add_link) {
 628     if (! $has_input_files) {
 629         croak "--add_link requires at least one module or man page reference";
 630     }
 631 }
 632 elsif ($has_input_files) {
 633     if ($regen || $show_counts || $do_upstream_cpan || $do_deltas) {
 634         croak "--regen, --counts, --deltas, and --cpan can't be used since using specific files";
 635     }
 636     foreach my $file (@files) {
 637         croak "Can't read file '$file'" if ! -r $file;
 638     }
 639 }
 640
 641 our %problems;  # potential problems found in this run
 642
 643 package My::Pod::Checker {      # Extend Pod::Checker
 644     use parent 'Pod::Checker';
 645
 646     # Uses inside out hash to protect from typos
 647     # For new fields, remember to add to destructor DESTROY()
 648     my %CFL_text;           # The text comprising the current C<>, F<>, or L<>
 649     my %C_text;             # If defined, are in a C<> section, and includes
 650                             # the accumulated text from that
 651     my %current_indent;     # Current line's indent
 652     my %filename;           # The pod is store in this file
 653     my %in_CFL;             # count of stacked C<>, F<>, L<> directives
 654     my %indents;            # Stack of indents from =over's in effect for
 655                             # current line
 656     my %in_for;             # true if in a =for or =begin
 657     my %in_NAME;            # true if within NAME section
 658     my %in_begin;           # true if within =begin section
 659     my %in_X;               # true if in a X<>
 660     my %linkable_item;      # Bool: if the latest =item is linkable.  It isn't
 661                             # for bullet and number lists
 662     my %linkable_nodes;     # Pod::Checker adds all =items to its node list,
 663                             # but not all =items are linkable to
 664     my %running_CFL_text;   # The current text that is being accumulated until
 665                             # an end_FOO is found, and this includes any C<>,
 666                             # F<>, or L<> directives.
 667     my %running_simple_text; # The currentt text that is being accumulated
 668                             # until an end_FOO is found, and all directives
 669                             # have been expanded into plain text
 670     my %command_count;      # Number of commands seen
 671     my %seen_pod_cmd;       # true if have =pod earlier
 672     my %skip;               # is SKIP set for this pod
 673     my %start_line;         # the first input line number in the the thing
 674                             # currently being worked on
 675
 676     sub DESTROY {
 677         my $addr = Scalar::Util::refaddr $_[0];
 678         delete $CFL_text{$addr};
 679         delete $C_text{$addr};
 680         delete $command_count{$addr};
 681         delete $current_indent{$addr};
 682         delete $filename{$addr};
 683         delete $in_begin{$addr};
 684         delete $in_CFL{$addr};
 685         delete $indents{$addr};
 686         delete $in_for{$addr};
 687         delete $in_NAME{$addr};
 688         delete $in_X{$addr};
 689         delete $linkable_item{$addr};
 690         delete $linkable_nodes{$addr};
 691         delete $running_CFL_text{$addr};
 692         delete $running_simple_text{$addr};
 693         delete $seen_pod_cmd{$addr};
 694         delete $skip{$addr};
 695         delete $start_line{$addr};
 696         return;
 697     }
 698
 699     sub new {
 700         my $class = shift;
 701         my $filename = shift;
 702
 703         my $self = $class->SUPER::new(-quiet => 1,
 704                                      -warnings => $Warnings_Level);
 705         my $addr = Scalar::Util::refaddr $self;
 706         $command_count{$addr} = 0;
 707         $current_indent{$addr} = 0;
 708         $filename{$addr} = $filename;
 709         $in_begin{$addr} = 0;
 710         $in_X{$addr} = 0;
 711         $in_CFL{$addr} = 0;
 712         $in_NAME{$addr} = 0;
 713         $linkable_item{$addr} = 0;
 714         $seen_pod_cmd{$addr} = 0;
 715         return $self;
 716     }
 717
 718     # re's for messages that Pod::Checker outputs
 719     my $location = qr/ \b (?:in|at|on|near) \s+ /xi;
 720     my $optional_location = qr/ (?: $location )? /xi;
 721     my $line_reference = qr/ [('"]? $optional_location \b line \s+
 722                              (?: \d+ | EOF | \Q???\E | - )
 723                              [)'"]? /xi;
 724
 725     sub poderror {  # Called to register a potential problem
 726
 727         # This adds an extra field to the parent hash, 'parameter'.  It is
 728         # used to extract the variable parts of a message leaving just the
 729         # constant skeleton.  This in turn allows the message to be
 730         # categorized better, so that it shows up as a single type in our
 731         # database, with the specifics of each occurrence not being stored with
 732         # it.
 733
 734         my $self = shift;
 735         my $opts = shift;
 736
 737         my $addr = Scalar::Util::refaddr $self;
 738         return if $skip{$addr};
 739
 740         # Input can be a string or hash.  If a string, parse it to separate
 741         # out the line number and convert to a hash for easier further
 742         # processing
 743         my $message;
 744         if (ref $opts ne 'HASH') {
 745             $message = join "", $opts, @_;
 746             my $line_number;
 747             if ($message =~ s/\s*($line_reference)//) {
 748                 ($line_number = $1) =~ s/\s*$optional_location//;
 749             }
 750             else {
 751                 $line_number = '???';
 752             }
 753             $opts = { -msg => $message, -line => $line_number };
 754         } else {
 755             $message = $opts->{'-msg'};
 756
 757         }
 758
 759         $message =~ s/^\d+\s+//;
 760         return if main::suppressed($message);
 761
 762         $self->SUPER::poderror($opts, @_);
 763
 764         $opts->{parameter} = "" unless $opts->{parameter};
 765
 766         # The variable parts of the message tend to be enclosed in '...',
 767         # "....", or (...).  Extract them and put them in an extra field,
 768         # 'parameter'.  This is trickier because the matching delimiter to a
 769         # '(' is its mirror, and not itself.  Text::Balanced could be used
 770         # instead.
 771         while ($message =~ m/ \s* $optional_location ( [('"] )/xg) {
 772             my $delimiter = $1;
 773             my $start = $-[0];
 774             $delimiter = ')' if $delimiter eq '(';
 775
 776             # If there is no ending delimiter, don't consider it to be a
 777             # variable part.  Most likely it is a contraction like "Don't"
 778             last unless $message =~ m/\G .+? \Q$delimiter/xg;
 779
 780             my $length = $+[0] - $start;
 781
 782             # Get the part up through the closing delimiter
 783             my $special = substr($message, $start, $length);
 784             $special =~ s/^\s+//;   # No leading whitespace
 785
 786             # And add that variable part to the parameter, while removing it
 787             # from the message.  This isn't a foolproof way of finding the
 788             # variable part.  For example '(s)' can occur in e.g.,
 789             # 'paragraph(s)'
 790             if ($special ne '(s)') {
 791                 substr($message, $start, $length) = "";
 792                 pos $message = $start;
 793                 $opts->{-msg} = $message;
 794                 $opts->{parameter} .= " " if $opts->{parameter};
 795                 $opts->{parameter} .= $special;
 796             }
 797         }
 798
 799         # Extract any additional line number given.  This is often the
 800         # beginning location of something whereas the main line number gives
 801         # the ending one.
 802         if ($message =~ /( $line_reference )/xi) {
 803             my $line_ref = $1;
 804             while ($message =~ s/\s*\Q$line_ref//) {
 805                 $opts->{-msg} = $message;
 806                 $opts->{parameter} .= " " if $opts->{parameter};
 807                 $opts->{parameter} .= $line_ref;
 808             }
 809         }
 810
 811         Carp::carp("Couldn't extract line number from '$message'") if $message =~ /line \d+/;
 812         push @{$problems{$filename{$addr}}{$message}}, $opts;
 813         #push @{$problems{$self->get_filename}{$message}}, $opts;
 814     }
 815
 816     # In the next subroutines, we keep track of the text of the current
 817     # innermost thing, like F<fooC<bar>baz>.  The things we care about raising
 818     # messages about in this program all come from a single sequence of
 819     # characters uninterrupted by other pod commands.  Therefore we don't have
 820     # to worry about recursion, and we can just set the string we care about
 821     # to empty on entrance to each command.
 822
 823     sub handle_text {
 824         # This is called by the parent class to deal with any straight text.
 825         # We mostly just append this to the running current value which will
 826         # be dealt with upon the end of the current construct, like a
 827         # paragraph.  But certain things don't contribute to checking the pod
 828         # and are ignored.  We also have set flags to indicate this text is
 829         # going towards constructing certain constructs, and handle those
 830         # specially.
 831
 832         my $self = shift;
 833         my $addr = Scalar::Util::refaddr $self;
 834
 835         my $return = $self->SUPER::handle_text(@_);
 836
 837         if ($in_X{$addr} || $in_for{$addr}) { # ignore
 838             return $return;
 839         }
 840
 841         my $text = join "\n", @_;
 842         $running_simple_text{$addr} .= $text;
 843
 844         # Keep separate tabs on C<>, F<>, and L<> directives, and one
 845         # especially for C<> ones.
 846         if ($in_CFL{$addr}) {
 847             $CFL_text{$addr} .= $text;
 848             $C_text{$addr} .= $text if defined $C_text{$addr};
 849         }
 850         else {
 851             # This variable is updated instead in the corresponding C, F, or L
 852             # handler.
 853             $running_CFL_text{$addr} .= $text;
 854         }
 855
 856         return $return;
 857     }
 858
 859     # The start_FOO routines check that somehow a C<> construct hasn't escaped
 860     # without being checked, and initialize things, and call the parent
 861     # class's equivalent routine.
 862
 863     # The end_FOO routines close things off, and check the text that has been
 864     # accumulated for FOO, then call the parent's corresponding routine.
 865
 866     sub start_Para {
 867         my $self = shift;
 868         check_see_but_not_link($self);
 869
 870         my $addr = Scalar::Util::refaddr $self;
 871         $start_line{$addr} = $_[0]->{start_line};
 872         $running_CFL_text{$addr} = "";
 873         $running_simple_text{$addr} = "";
 874         return $self->SUPER::start_Para(@_);
 875     }
 876
 877     sub start_item_text {
 878         my $self = shift;
 879         check_see_but_not_link($self);
 880
 881         my $addr = Scalar::Util::refaddr $self;
 882         $start_line{$addr} = $_[0]->{start_line};
 883         $running_CFL_text{$addr} = "";
 884         $running_simple_text{$addr} = "";
 885
 886         # This is the only =item that is linkable
 887         $linkable_item{$addr} = 1;
 888
 889         return $self->SUPER::start_item_text(@_);
 890     }
 891
 892     sub start_item_number {
 893         my $self = shift;
 894         check_see_but_not_link($self);
 895
 896         my $addr = Scalar::Util::refaddr $self;
 897         $start_line{$addr} = $_[0]->{start_line};
 898         $running_CFL_text{$addr} = "";
 899         $running_simple_text{$addr} = "";
 900
 901         return $self->SUPER::start_item_number(@_);
 902     }
 903
 904     sub start_item_bullet {
 905         my $self = shift;
 906         check_see_but_not_link($self);
 907
 908         my $addr = Scalar::Util::refaddr $self;
 909         $start_line{$addr} = $_[0]->{start_line};
 910         $running_CFL_text{$addr} = "";
 911         $running_simple_text{$addr} = "";
 912
 913         return $self->SUPER::start_item_bullet(@_);
 914     }
 915
 916     sub end_item {  # No difference in =item types endings
 917         my $self = shift;
 918         check_see_but_not_link($self);
 919         return $self->SUPER::end_item(@_);
 920     }
 921
 922     sub start_over {
 923         my $self = shift;
 924         check_see_but_not_link($self);
 925
 926         my $addr = Scalar::Util::refaddr $self;
 927         $start_line{$addr} = $_[0]->{start_line};
 928         $running_CFL_text{$addr} = "";
 929         $running_simple_text{$addr} = "";
 930
 931         # Save this indent on a stack, and keep track of total indent
 932         my $indent =  $_[0]{'indent'};
 933         push @{$indents{$addr}}, $indent;
 934         $current_indent{$addr} += $indent;
 935
 936         return $self->SUPER::start_over(@_);
 937     }
 938
 939     sub end_over_bullet { shift->end_over(@_) }
 940     sub end_over_number { shift->end_over(@_) }
 941     sub end_over_text   { shift->end_over(@_) }
 942     sub end_over_block  { shift->end_over(@_) }
 943     sub end_over_empty  { shift->end_over(@_) }
 944     sub end_over {
 945         my $self = shift;
 946         check_see_but_not_link($self);
 947
 948         my $addr = Scalar::Util::refaddr $self;
 949
 950         # Pop current indent
 951         if (@{$indents{$addr}}) {
 952             $current_indent{$addr} -= pop @{$indents{$addr}};
 953         }
 954         else {
 955             # =back without corresponding =over, but should have
 956             # warned already
 957             $current_indent{$addr} = 0;
 958         }
 959     }
 960
 961     sub check_see_but_not_link {
 962
 963         # Looks through accumulated text for current element that includes the
 964         # C<>, F<>, and L<> directives for ones that look like they are
 965         # C<link> instead of L<link>.
 966
 967         my $self = shift;
 968         my $addr = Scalar::Util::refaddr $self;
 969
 970         return unless defined $running_CFL_text{$addr};
 971
 972         while ($running_CFL_text{$addr} =~ m{
 973                                 ( (?: \w+ \s+ )* )  # The phrase before, if any
 974                                 \b [Ss]ee \s+
 975                                 ( ( [^L] )
 976                                   <
 977                                   ( [^<]*? )  # The not < excludes nested C<L<...
 978                                   >
 979                                 )
 980                                 ( \s+ (?: under | in ) \s+ L< )?
 981                             }xg)
 982         {
 983             my $prefix = $1 // "";
 984             my $construct = $2;     # The whole thing, like C<...>
 985             my $type = $3;
 986             my $interior = $4;
 987             my $trailing = $5;      # After the whole thing ending in "L<"
 988
 989             # If the full phrase is something like, "you might see C<", or
 990             # similar, it really isn't a reference to a link.  The ones I saw
 991             # all had the word "you" in them; and the "you" wasn't the
 992             # beginning of a sentence.
 993             if ($prefix !~ / \b you \b /x) {
 994
 995                 # Now, find what the module or man page name within the
 996                 # construct would be if it actually has L<> syntax.  If it
 997                 # doesn't have that syntax, will set the module to the entire
 998                 # interior.
 999                 if (! defined $trailing # not referring to something in another
1000                                         # section
1001                     && $interior !~ /$non_pods/
1002
1003                     # There can't be spaces (I think) in module names or man
1004                     # pages
1005                     && $interior !~ / \s /x
1006
1007                     # F<> that end in eg \.pl are almost certainly ok, as are
1008                     # those that look like a path with multiple "/" chars
1009                     && ($type ne "F"
1010                         || (! -e $interior
1011                             && $interior !~ /\.\w+$/
1012                             && $interior !~ /\/.+\//)
1013                     )
1014                 ) {
1015                     # TODO: move the checking of $pedantic higher up
1016                     $self->poderror({ -line => $start_line{$addr},
1017                         -msg => $C_not_linked,
1018                         parameter => $construct
1019                     });
1020                 }
1021             }
1022         }
1023
1024         undef $running_CFL_text{$addr};
1025     }
1026
1027     sub end_Para {
1028         my $self = shift;
1029         check_see_but_not_link($self);
1030
1031         my $addr = Scalar::Util::refaddr $self;
1032         if ($in_NAME{$addr}) {
1033             if ($running_simple_text{$addr} =~ /^\s*(\S+?)\s*$/) {
1034                 $self->poderror({ -line => $start_line{$addr},
1035                     -msg => $missing_name_description,
1036                     parameter => $1});
1037             }
1038             $in_NAME{$addr} = 0;
1039         }
1040         $self->SUPER::end_Para(@_);
1041     }
1042
1043     sub start_head1 {
1044         my $self = shift;
1045         check_see_but_not_link($self);
1046
1047         my $addr = Scalar::Util::refaddr $self;
1048         $start_line{$addr} = $_[0]->{start_line};
1049         $running_CFL_text{$addr} = "";
1050         $running_simple_text{$addr} = "";
1051
1052         return $self->SUPER::start_head1(@_);
1053     }
1054
1055     sub end_head1 {  # This is called at the end of the =head line.
1056         my $self = shift;
1057         check_see_but_not_link($self);
1058
1059         my $addr = Scalar::Util::refaddr $self;
1060
1061         $in_NAME{$addr} = 1 if $running_simple_text{$addr} eq 'NAME';
1062         return $self->SUPER::end_head(@_);
1063     }
1064
1065     sub start_Verbatim {
1066         my $self = shift;
1067         check_see_but_not_link($self);
1068
1069         my $addr = Scalar::Util::refaddr $self;
1070         $running_simple_text{$addr} = "";
1071         $start_line{$addr} = $_[0]->{start_line};
1072         return $self->SUPER::start_Verbatim(@_);
1073     }
1074
1075     sub end_Verbatim {
1076         my $self = shift;
1077         my $addr = Scalar::Util::refaddr $self;
1078
1079         # Pick up the name if it looks like one, since the parent class
1080         # doesn't handle verbatim NAMEs
1081         if ($in_NAME{$addr}
1082             && $running_simple_text{$addr} =~ /^\s*(\S+?)\s*[,-]/)
1083         {
1084             $self->name($1);
1085         }
1086
1087         my $indent = $self->get_current_indent;
1088
1089         # Look at each line to verify it is short enough
1090         my @lines = split /^/, $running_simple_text{$addr};
1091         for my $i (0 .. @lines - 1) {
1092             $lines[$i] =~ s/\s+$//;
1093             my $exceeds = length(Text::Tabs::expand($lines[$i]))
1094                         + $indent - $MAX_LINE_LENGTH;
1095             next unless $exceeds > 0;
1096
1097             $self->poderror({ -line => $start_line{$addr} + $i,
1098                 -msg => $line_length,
1099                 parameter => "+$exceeds (including " . ($indent - $INDENT) . " from =over's)",
1100             });
1101         }
1102
1103         undef $running_simple_text{$addr};
1104
1105         # Parent class didn't bother to define this
1106         #return $self->SUPER::SUPER::end_Verbatim(@_);
1107     }
1108
1109     sub start_C {
1110         my $self = shift;
1111         my $addr = Scalar::Util::refaddr $self;
1112
1113         $C_text{$addr} = "";
1114
1115         # If not in a stacked set of C<>, F<> and L<>, initialize the text for
1116         # them.
1117         $CFL_text{$addr} = "" if ! $in_CFL{$addr};
1118         $in_CFL{$addr}++;
1119
1120         return $self->SUPER::start_C(@_);
1121     }
1122
1123     sub start_F {
1124         my $self = shift;
1125         my $addr = Scalar::Util::refaddr $self;
1126
1127         $CFL_text{$addr} = "" if ! $in_CFL{$addr};
1128         $in_CFL{$addr}++;
1129         return $self->SUPER::start_F(@_);
1130     }
1131
1132     sub start_L {
1133         my $self = shift;
1134         my $addr = Scalar::Util::refaddr $self;
1135
1136         $CFL_text{$addr} = "" if ! $in_CFL{$addr};
1137         $in_CFL{$addr}++;
1138         return $self->SUPER::start_L(@_);
1139     }
1140
1141     sub end_C {
1142         my $self = shift;
1143         my $addr = Scalar::Util::refaddr $self;
1144
1145         # Warn if looks like a file or link enclosed instead by this C<>
1146         if ($C_text{$addr} =~ qr/^ $C_path_re $/x) {
1147             # Here it does look like it could be be a file path or a link.
1148             # But some varieties of regex patterns could also fit with what we
1149             # have so far.  Weed those out as best we can.  '/foo/' is almost
1150             # certainly meant to be a pattern, as is '/foo/g'.
1151             my $is_pattern;
1152             if ($C_text{$addr} !~ qr| ^ / [^/]* / ( [msixpodualngcr]* ) $ |x) {
1153                 $is_pattern = 0;
1154             }
1155             else {
1156
1157                 # Here, it looks like a pattern potentially followed by some
1158                 # modifiers.  To make doubly sure, don't count as patterns
1159                 # those constructs which have more occurrences (generally 1)
1160                 # of a modifier than is legal.
1161                 my %counts;
1162                 map { $counts{$_}++ } split "", $1;
1163                 foreach my $modifier (keys %counts) {
1164                     if ($counts{$modifier} > (($modifier eq 'a')
1165                                               ? 2
1166                                               : 1))
1167                     {
1168                         $is_pattern = 0;
1169                         last;
1170                     }
1171                 }
1172                 $is_pattern = 1 unless defined $is_pattern;
1173             }
1174
1175             unless ($is_pattern) {
1176                 $self->poderror({ -line => $start_line{$addr},
1177                     -msg => $C_with_slash,
1178                     parameter => "C<$C_text{$addr}>"
1179                 });
1180             }
1181         }
1182         undef $C_text{$addr};
1183
1184         # Add the current text to the running total.  This was not done in
1185         # handle_text(), because it just sees the plain text of the innermost
1186         # stacked directive.  We want to keep all the directive names
1187         # enclosing the text.  Otherwise the fact that C<L<foobar>> is to a
1188         # link would be lost, as the L<> would be gone.
1189         $CFL_text{$addr} = "C<$CFL_text{$addr}>";
1190
1191         # Add this text to the the whole running total only if popping this
1192         # directive off the stack leaves it empty.  As long as something is on
1193         # the stack, it gets added to $CFL_text (just above).  It is only
1194         # entirely constructed when the stack is empty.
1195         $in_CFL{$addr}--;
1196         $running_CFL_text{$addr} .= $CFL_text{$addr} if ! $in_CFL{$addr};
1197
1198         return $self->SUPER::end_C(@_);
1199     }
1200
1201     sub end_F {
1202         my $self = shift;
1203         my $addr = Scalar::Util::refaddr $self;
1204
1205         $CFL_text{$addr} = "F<$CFL_text{$addr}>";
1206         $in_CFL{$addr}--;
1207         $running_CFL_text{$addr} .= $CFL_text{$addr} if ! $in_CFL{$addr};
1208         return $self->SUPER::end_F(@_);
1209     }
1210
1211     sub end_L {
1212         my $self = shift;
1213         my $addr = Scalar::Util::refaddr $self;
1214
1215         $CFL_text{$addr} = "L<$CFL_text{$addr}>";
1216         $in_CFL{$addr}--;
1217         $running_CFL_text{$addr} .= $CFL_text{$addr} if ! $in_CFL{$addr};
1218         return $self->SUPER::end_L(@_);
1219     }
1220
1221     sub start_X {
1222         my $self = shift;
1223         my $addr = Scalar::Util::refaddr $self;
1224
1225         $in_X{$addr} = 1;
1226         return $self->SUPER::start_X(@_);
1227     }
1228
1229     sub end_X {
1230         my $self = shift;
1231         my $addr = Scalar::Util::refaddr $self;
1232
1233         $in_X{$addr} = 0;
1234         return $self->SUPER::end_X(@_);
1235     }
1236
1237     sub start_for {
1238         my $self = shift;
1239         my $addr = Scalar::Util::refaddr $self;
1240
1241         $in_for{$addr} = 1;
1242         return $self->SUPER::start_for(@_);
1243     }
1244
1245     sub end_for {
1246         my $self = shift;
1247         my $addr = Scalar::Util::refaddr $self;
1248
1249         $in_for{$addr} = 0;
1250         return $self->SUPER::end_for(@_);
1251     }
1252
1253     sub hyperlink {
1254         my ($self, $link) = @_;
1255
1256         if ($link && $link->type eq 'pod') {
1257             my $page = $link->page;
1258             my $node = $link->node;
1259
1260             # If the hyperlink is to an interior node of another page, save it
1261             # so that we can see if we need to parse normally skipped files.
1262             $has_referred_to_node{$page} = 1 if $node;
1263
1264             # Ignore certain placeholder links in perldelta.  Check if the
1265             # link is page-level, and also check if to a node within the page
1266             if (   $self->name && $self->name eq "perldelta"
1267                 && ((  grep { $page eq $_ } @perldelta_ignore_links)
1268                     || (   $node
1269                         && (grep { "$page/$node" eq $_ } @perldelta_ignore_links)
1270             ))) {
1271                 return;
1272             }
1273         }
1274
1275         return $self->SUPER::hyperlink($link);
1276     }
1277
1278     sub node {
1279         my $self = shift;
1280         my $text = $_[0];
1281         if($text) {
1282             $text =~ s/\s+$//s; # strip trailing whitespace
1283             $text =~ s/\s+/ /gs; # collapse whitespace
1284             my $addr = Scalar::Util::refaddr $self;
1285             push(@{$linkable_nodes{$addr}}, $text) if
1286                                     ! $current_indent{$addr}
1287                                     || $linkable_item{$addr};
1288         }
1289         return $self->SUPER::node($_[0]);
1290     }
1291
1292     sub get_current_indent {
1293         return $INDENT + $current_indent{Scalar::Util::refaddr $_[0]};
1294     }
1295
1296     sub get_filename {
1297         return $filename{Scalar::Util::refaddr $_[0]};
1298     }
1299
1300     sub linkable_nodes {
1301         my $linkables = $linkable_nodes{Scalar::Util::refaddr $_[0]};
1302         return undef unless $linkables;
1303         return @$linkables;
1304     }
1305
1306     sub get_skip {
1307         return $skip{Scalar::Util::refaddr $_[0]} // 0;
1308     }
1309
1310     sub set_skip {
1311         my $self = shift;
1312         $skip{Scalar::Util::refaddr $self} = shift;
1313
1314         # If skipping, no need to keep the problems for it
1315         delete $problems{$self->get_filename};
1316         return;
1317     }
1318
1319     sub parse_from_file {
1320         # This overrides the super class method so that if an open fails on a
1321         # transitory file, it doesn't croak.  It returns 1 if it did find the
1322         # file, 0 if it didn't
1323
1324         my $self = shift;
1325         my $filename = shift;
1326         # ignores 2nd param, which is output file.  Always uses undef
1327
1328         if (open my $in_fh, '<:bytes', $filename) {
1329             $self->SUPER::parse_from_file($in_fh, undef);
1330             close $in_fh;
1331             return 1;
1332         }
1333
1334         # If couldn't open file, perhaps it was transitory, and hence not an error
1335         return 0 unless -e $filename;
1336
1337         die "Can't open '$filename': $!\n";
1338     }
1339 }
1340
1341 package Tie_Array_to_FH {  # So printing actually goes to an array
1342
1343     my %array;
1344
1345     sub TIEHANDLE {
1346         my $class = shift;
1347         my $array_ref = shift;
1348
1349         my $self = bless \do{ my $anonymous_scalar }, $class;
1350         $array{Scalar::Util::refaddr $self} = $array_ref;
1351
1352         return $self;
1353     }
1354
1355     sub PRINT {
1356         my $self = shift;
1357         push @{$array{Scalar::Util::refaddr $self}}, @_;
1358         return 1;
1359     }
1360 }
1361
1362
1363 my %filename_to_checker; # Map a filename to its pod checker object
1364 my %id_to_checker;       # Map a checksum to its pod checker object
1365 my %nodes;               # key is filename, values are nodes in that file.
1366 my %nodes_first_word;    # same, but value is first word of each node
1367 my %valid_modules;       # List of modules known to exist outside us.
1368 my %digests;             # checksums of files, whose names are the keys
1369 my %filename_to_pod;     # Map a filename to its pod NAME
1370 my %files_with_unknown_issues;
1371 my %files_with_fixes;
1372
1373 my $data_fh;
1374 open $data_fh, '<:bytes', $known_issues or die "Can't open $known_issues";
1375
1376 my %counts; # For --counts param, count of each issue type
1377 my %suppressed_files;   # Files with at least one issue type to suppress
1378 my $HEADER = <<END;
1379 # This file is the data file for $0.
1380 # There are three types of lines.
1381 # Comment lines are white-space only or begin with a '#', like this one.  Any
1382 #   changes you make to the comment lines will be lost when the file is
1383 #   regen'd.
1384 # Lines without tab characters are simply NAMES of pods that the program knows
1385 #   will have links to them and the program does not check if those links are
1386 #   valid.
1387 # All other lines should have three fields, each separated by a tab.  The
1388 #   first field is the name of a pod; the second field is an error message
1389 #   generated by this program; and the third field is a count of how many
1390 #   known instances of that message there are in the pod.  -1 means that the
1391 #   program can expect any number of this type of message.
1392 END
1393
1394 my @existing_issues;
1395
1396
1397 while (<$data_fh>) {    # Read the database
1398     chomp;
1399     next if /^\s*(?:#|$)/;  # Skip comment and empty lines
1400     if (/\t/) {
1401         next if $show_all;
1402         if ($add_link) {    # The issues are saved and later output unchanged
1403             push @existing_issues, $_;
1404             next;
1405         }
1406
1407         # Keep track of counts of each issue type for each file
1408         my ($filename, $message, $count) = split /\t/;
1409         $known_problems{$filename}{$message} = $count;
1410
1411         if ($show_counts) {
1412             if ($count < 0) {   # -1 means to suppress this issue type
1413                 $suppressed_files{$filename} = $filename;
1414             }
1415             else {
1416                 $counts{$message} += $count;
1417             }
1418         }
1419     }
1420     else {  # Lines without a tab are modules known to be valid
1421         $valid_modules{$_} = 1
1422     }
1423 }
1424 close $data_fh;
1425
1426 if ($add_link) {
1427     $copy_fh = open_new($known_issues);
1428
1429     # Check for basic sanity, and add each command line argument
1430     foreach my $module (@files) {
1431         die "\"$module\" does not look like a module or man page"
1432             # Must look like (A or A::B or A::B::C ..., or foo(3C)
1433             if $module !~ /^ (?: \w+ (?: :: \w+ )* | \w+ \( \d \w* \) ) $/x;
1434         $valid_modules{$module} = 1
1435     }
1436     my_safer_print($copy_fh, $HEADER);
1437     foreach (sort { lc $a cmp lc $b } keys %valid_modules) {
1438         my_safer_print($copy_fh, $_, "\n");
1439     }
1440
1441     # The rest of the db file is output unchanged.
1442     my_safer_print($copy_fh, join "\n", @existing_issues, "");
1443
1444     close_and_rename($copy_fh);
1445     exit;
1446 }
1447
1448 if ($show_counts) {
1449     my $total = 0;
1450     foreach my $message (sort keys %counts) {
1451         $total += $counts{$message};
1452         note(Text::Tabs::expand("$counts{$message}\t$message"));
1453     }
1454     note("-----\n" . Text::Tabs::expand("$total\tknown potential issues"));
1455     if (%suppressed_files) {
1456         note("\nFiles that have all messages of at least one type suppressed:");
1457         note(join ",", keys %suppressed_files);
1458     }
1459     exit 0;
1460 }
1461
1462 # re to match files that are to be parsed only if there is an internal link
1463 # to them.  It does not include cpan, as whether those are parsed depends
1464 # on a switch.  Currently, only perltoc and the stable perldelta.pod's
1465 # are included.  The latter all have characters between 'perl' and
1466 # 'delta'.  (Actually the currently developed one matches as well, but
1467 # is a duplicate of perldelta.pod, so can be skipped, so fine for it to
1468 # match this.
1469 my $only_for_interior_links_re = qr/ ^ pod\/perltoc.pod $
1470                                    /x;
1471 unless ($do_deltas) {
1472     $only_for_interior_links_re = qr/$only_for_interior_links_re |
1473                                     \b perl \d+ delta \. pod \b
1474                                 /x;
1475 }
1476
1477 { # Closure
1478     my $first_time = 1;
1479
1480     sub output_thanks ($$$$) {  # Called when an issue has been fixed
1481         my $filename = shift;
1482         my $original_count = shift;
1483         my $current_count = shift;
1484         my $message = shift;
1485
1486         $files_with_fixes{$filename} = 1;
1487         my $return;
1488         my $fixed_count = $original_count - $current_count;
1489         my $a_problem = ($fixed_count == 1) ? "a problem" : "multiple problems";
1490         my $another_problem = ($fixed_count == 1) ? "another problem" : "another set of problems";
1491         my $diff;
1492         if ($message) {
1493             $diff = <<EOF;
1494 There were $original_count occurrences (now $current_count) in this pod of type
1495 "$message",
1496 EOF
1497         } else {
1498             $diff = <<EOF;
1499 There are no longer any problems found in this pod!
1500 EOF
1501         }
1502
1503         if ($first_time) {
1504             $first_time = 0;
1505             $return = <<EOF;
1506 Thanks for fixing $a_problem!
1507 $diff
1508 Now you must teach $0 that this was fixed.
1509 EOF
1510         }
1511         else {
1512             $return = <<EOF
1513 Thanks for fixing $another_problem.
1514 $diff
1515 EOF
1516         }
1517
1518         return $return;
1519     }
1520 }
1521
1522 sub my_safer_print {    # print, with error checking for outputting to db
1523     my ($fh, @lines) = @_;
1524
1525     if (! print $fh @lines) {
1526         my $save_error = $!;
1527         close($fh);
1528         die "Write failure: $save_error";
1529     }
1530 }
1531
1532 sub extract_pod {   # Extracts just the pod from a file; returns undef if file
1533                     # doesn't exist
1534     my $filename = shift;
1535     use Pod::Parser;
1536
1537     my @pod;
1538
1539     # Arrange for the output of Pod::Parser to be collected in an array we can
1540     # look at instead of being printed
1541     tie *ALREADY_FH, 'Tie_Array_to_FH', \@pod;
1542     if (open my $in_fh, '<:bytes', $filename) {
1543         my $parser = Pod::Parser->new();
1544         $parser->parse_from_filehandle($in_fh, *ALREADY_FH);
1545         close $in_fh;
1546
1547         return join "", @pod
1548     }
1549
1550     # The file should already have been opened once to get here, so if that
1551     # fails, something is wrong.  It's possible that a transitory file
1552     # containing a pod would get here, so if the file no longer exists just
1553     # return undef.
1554     return unless -e $filename;
1555     die "Can't open '$filename': $!\n";
1556 }
1557
1558 my $digest = Digest->new($digest_type);
1559
1560 # This is used as a callback from File::Find::find(), which always constructs
1561 # pathnames using Unix separators
1562 sub is_pod_file {
1563     # If $_ is a pod file, add it to the lists and do other prep work.
1564
1565     if (-d) {
1566         # Don't look at files in directories that are for tests, nor those
1567         # beginning with a dot
1568         if (m!/t\z! || m!/\.!) {
1569             $File::Find::prune = 1;
1570         }
1571         return;
1572     }
1573
1574     return unless -r && -s;    # Can't check it if can't read it; no need to
1575                                # check if 0 length
1576     return unless -f || -l;    # Weird file types won't be pods
1577
1578     my ($leaf) = m!([^/]+)\z!;
1579     if (m!/\.!                 # No hidden Unix files
1580         || $leaf =~ $non_pods) {
1581         note("Not considering $_") if DEBUG;
1582         return;
1583     }
1584
1585     my $filename = $File::Find::name;
1586
1587     # $filename is relative, like './path'.  Strip that initial part away.
1588     $filename =~ s!^\./!! or die 'Unexpected pathname "$filename"';
1589
1590     return if $excluded_files{canonicalize($filename)};
1591
1592     my $contents = do {
1593         local $/;
1594         my $candidate;
1595         if (! open $candidate, '<:bytes', $_) {
1596
1597             # If a transitory file was found earlier, the open could fail
1598             # legitimately and we just skip the file; also skip it if it is a
1599             # broken symbolic link, as it is probably just a build problem;
1600             # certainly not a file that we would want to check the pod of.
1601             # Otherwise fail it here and no reason to process it further.
1602             # (But the test count will be off too)
1603             ok(0, "Can't open '$filename': $!")
1604                                             if -r $filename && ! -l $filename;
1605             return;
1606         }
1607         <$candidate>;
1608     };
1609
1610     # If the file is a .pm or .pod, having any initial '=' on a line is
1611     # grounds for testing it.  Otherwise, require a head1 NAME line to
1612     # consider it as a potential pod
1613     if ($filename =~ /\.(?:pm|pod)/) {
1614         return unless $contents =~ /^=/m;
1615     } else {
1616         return unless $contents =~ /^=head1 +NAME/m;
1617     }
1618
1619     # Here, we know that the file is a pod.  Add it to the list of files
1620     # to check and create a checker object for it.
1621
1622     push @files, $filename;
1623     my $checker = My::Pod::Checker->new($filename);
1624     $filename_to_checker{$filename} = $checker;
1625
1626     # In order to detect duplicate pods and only analyze them once, we
1627     # compute checksums for the file, so don't have to do an exact
1628     # compare.  Note that if the pod is just part of the file, the
1629     # checksums can differ for the same pod.  That special case is handled
1630     # later, since if the checksums of the whole file are the same, that
1631     # case won't even come up.  We don't need the checksums for files that
1632     # we parse only if there is a link to its interior, but we do need its
1633     # NAME, which is also retrieved in the code below.
1634
1635     if ($filename =~ / (?: ^(cpan|lib|ext|dist)\/ )
1636                         | $only_for_interior_links_re
1637                     /x) {
1638         $digest->add($contents);
1639         $digests{$filename} = $digest->digest;
1640
1641         # lib files aren't analyzed if they are duplicates of files copied
1642         # there from some other directory.  But to determine this, we need
1643         # to know their NAMEs.  We might as well find the NAME now while
1644         # the file is open.  Similarly, cpan files aren't analyzed unless
1645         # we're analyzing all of them, or this particular file is linked
1646         # to by a file we are analyzing, and thus we will want to verify
1647         # that the target exists in it.  We need to know at least the NAME
1648         # to see if it's worth analyzing, or so we can determine if a lib
1649         # file is a copy of a cpan one.
1650         if ($filename =~ m{ (?: ^ (?: cpan | lib ) / )
1651                             | $only_for_interior_links_re
1652                             }x) {
1653             if ($contents =~ /^=head1 +NAME.*/mg) {
1654                 # The NAME is the first non-spaces on the line up to a
1655                 # comma, dash or end of line.  Otherwise, it's invalid and
1656                 # this pod doesn't have a legal name that we're smart
1657                 # enough to find currently.  But the  parser will later
1658                 # find it if it thinks there is a legal name, and set the
1659                 # name
1660                 if ($contents =~ /\G    # continue from the line after =head1
1661                                   \s*   # ignore any empty lines
1662
1663                                   # ignore =for paragraphs followed by empty
1664                                   # lines
1665                                   (?: ^ =for .*? \n (?: [^\s]*? \n )* \s* )*
1666
1667                                   ^ \s* ( \S+?) \s* (?: [,-] | $ )/mx) {
1668                     my $name = $1;
1669                     $checker->name($name);
1670                     $id_to_checker{$name} = $checker
1671                         if $filename =~ m{^cpan/};
1672                 }
1673             }
1674             elsif ($filename =~ m{^cpan/}) {
1675                 $id_to_checker{$digests{$filename}} = $checker;
1676             }
1677         }
1678     }
1679
1680     return;
1681 } # End of is_pod_file()
1682
1683 # Start of real code that isn't processing the command line (except the
1684 # db is read in above, as is processing of the --add_link option).
1685 # Here, @files contains list of files on the command line.  If have any of
1686 # these, unconditionally test them, and show all the errors, even the known
1687 # ones, and, since not testing other pods, don't do cross-pod link tests.
1688 # (Could add extra code to do cross-pod tests for the ones in the list.)
1689
1690 if ($has_input_files) {
1691     undef %known_problems;
1692     $do_upstream_cpan = $do_deltas = 1;  # In case one of the inputs is one
1693                                          # of these types
1694 }
1695 else { # No input files -- go find all the possibilities.
1696     if ($regen) {
1697         $copy_fh = open_new($known_issues);
1698         note("Regenerating $known_issues, please be patient...");
1699         print $copy_fh $HEADER;
1700     }
1701
1702     # Move to the directory above us, but have to adjust @INC to account for
1703     # that.
1704     s{^\.\./lib$}{lib} for @INC;
1705     chdir File::Spec->updir;
1706
1707     # And look in this directory and all its subdirectories
1708     find( {wanted => \&is_pod_file, no_chdir => 1}, '.');
1709
1710     # Add ourselves to the test
1711     push @files, "t/porting/podcheck.t";
1712 }
1713
1714 # Now we know how many tests there will be.
1715 plan (tests => scalar @files) if ! $regen;
1716
1717
1718 # Sort file names so we get consistent results, and to put cpan last,
1719 # preceded by the ones that we don't generally parse.  This is because both
1720 # these classes are generally parsed only if there is a link to the interior
1721 # of them, and we have to parse all others first to guarantee that they don't
1722 # have such a link. 'lib' files come just before these, as some of these are
1723 # duplicates of others.  We already have figured this out when gathering the
1724 # data as a special case for all such files, but this, while unnecessary,
1725 # puts the derived file last in the output.  'readme' files come before those,
1726 # as those also could be duplicates of others, which are considered the
1727 # primary ones.  These currently aren't figured out when gathering data, so
1728 # are done here.
1729 @files = sort { if ($a =~ /^cpan/) {
1730                    return 1 if $b !~ /^cpan/;
1731                    return lc $a cmp lc $b;
1732                }
1733                elsif ($b =~ /^cpan/) {
1734                    return -1;
1735                }
1736                elsif ($a =~ /$only_for_interior_links_re/) {
1737                    return 1 if $b !~ /$only_for_interior_links_re/;
1738                    return lc $a cmp lc $b;
1739                }
1740                elsif ($b =~ /$only_for_interior_links_re/) {
1741                    return -1;
1742                }
1743                elsif ($a =~ /^lib/) {
1744                    return 1 if $b !~ /^lib/;
1745                    return lc $a cmp lc $b;
1746                }
1747                elsif ($b =~ /^lib/) {
1748                    return -1;
1749                } elsif ($a =~ /\breadme\b/i) {
1750                    return 1 if $b !~ /\breadme\b/i;
1751                    return lc $a cmp lc $b;
1752                }
1753                elsif ($b =~ /\breadme\b/i) {
1754                    return -1;
1755                }
1756                else {
1757                    return lc $a cmp lc $b;
1758                }
1759            }
1760            @files;
1761
1762 # Now go through all the files and parse them
1763 FILE:
1764 foreach my $filename (@files) {
1765     my $parsed = 0;
1766     note("parsing $filename") if DEBUG;
1767
1768     # We may have already figured out some things in the process of generating
1769     # the file list.  If so, we have a $checker object already.  But if not,
1770     # generate one now.
1771     my $checker = $filename_to_checker{$filename};
1772     if (! $checker) {
1773         $checker = My::Pod::Checker->new($filename);
1774         $filename_to_checker{$filename} = $checker;
1775     }
1776
1777     # We have set the name in the checker object if there is a possibility
1778     # that no further parsing is necessary, but otherwise do the parsing now.
1779     if (! $checker->name) {
1780         if (! $checker->parse_from_file($filename, undef)) {
1781             $checker->set_skip("$filename is transitory");
1782             next FILE;
1783         }
1784         $parsed = 1;
1785
1786     }
1787
1788     if ($checker->num_errors() < 0) {   # Returns negative if not a pod
1789         $checker->set_skip("$filename is not a pod");
1790     }
1791     else {
1792
1793         # Here, is a pod.  See if it is one that has already been tested,
1794         # or should be tested under another directory.  Use either its NAME
1795         # if it has one, or a checksum if not.
1796         my $name = $checker->name;
1797         my $id;
1798
1799         if ($name) {
1800             $id = $name;
1801         }
1802         else {
1803             my $digest = Digest->new($digest_type);
1804             my $contents = extract_pod($filename);
1805
1806             # If the return is undef, it means that $filename was a transitory
1807             # file; skip it.
1808             next FILE unless defined $contents;
1809             $digest->add($contents);
1810             $id = $digest->digest;
1811         }
1812
1813         # If there is a match for this pod with something that we've already
1814         # processed, don't process it, and output why.
1815         my $prior_checker;
1816         if (defined ($prior_checker = $id_to_checker{$id})
1817             && $prior_checker != $checker)  # Could have defined the checker
1818                                             # earlier without pursuing it
1819         {
1820
1821             # If the pods are identical, then it's just a copy, and isn't an
1822             # error.  First use the checksums we have already computed to see
1823             # if the entire files are identical, which means that the pods are
1824             # identical too.
1825             my $prior_filename = $prior_checker->get_filename;
1826             my $same = (! $name
1827                         || ($digests{$prior_filename}
1828                             && $digests{$filename}
1829                             && $digests{$prior_filename} eq $digests{$filename}));
1830
1831             # If they differ, it could be that the files differ for some
1832             # reason, but the pods they contain are identical.  Extract the
1833             # pods and do the comparisons on just those.
1834             if (! $same && $name) {
1835                 my $contents = extract_pod($filename);
1836
1837                 # If return is <undef>, it means that $filename no longer
1838                 # exists.  This means it was a transitory file, and should not
1839                 # be tested.
1840                 next FILE unless defined $contents;
1841
1842                 my $prior_contents = extract_pod($prior_filename);
1843
1844                 # If return is <undef>, it means that $prior_filename no
1845                 # longer exists.  This means it was a transitory file, and
1846                 # should not have been tested, but we already did process it.
1847                 # What we should do now is to back-out its records, and
1848                 # process $filename in its stead.  But backing out is not so
1849                 # simple, and so I'm (khw) skipping that unless and until
1850                 # experience shows that it is needed.  We do go process
1851                 # $filename, and there are potential false positive conflicts
1852                 # with the transitory $prior_contents, and rerunning the test
1853                 # should cause it to succeed.
1854                 goto process_this_pod unless defined $prior_contents;
1855
1856                 $same = $prior_contents eq $contents;
1857             }
1858
1859             use File::Basename 'basename';
1860             if ($same) {
1861                 $checker->set_skip("The pod of $filename is a duplicate of "
1862                                     . "the pod for $prior_filename");
1863             } elsif ($prior_filename =~ /\breadme\b/i) {
1864                 $checker->set_skip("$prior_filename is a README apparently for $filename");
1865             } elsif ($filename =~ /\breadme\b/i) {
1866                 $checker->set_skip("$filename is a README apparently for $prior_filename");
1867             } elsif (! $do_upstream_cpan
1868                      && $filename =~ /^cpan/
1869                      && $prior_filename =~ /^cpan/)
1870             {
1871                 $checker->set_skip("CPAN is upstream for $filename");
1872             } elsif ( $filename =~ /^utils/ or $prior_filename =~ /^utils/ ) {
1873                 $checker->set_skip("$filename copy is in utils/");
1874             } elsif ($prior_filename =~ /^(?:cpan|ext|dist)/
1875                      && $filename !~ /^(?:cpan|ext|dist)/
1876                      && basename($prior_filename) eq basename($filename))
1877             {
1878                 $checker->set_skip("$filename: Need to run make?");
1879             } else { # Here have two pods with identical names that differ
1880                 $prior_checker->poderror(
1881                         { -msg => $duplicate_name,
1882                             -line => "???",
1883                             parameter => "'$filename' also has NAME '$name'"
1884                         });
1885                 $checker->poderror(
1886                     { -msg => $duplicate_name,
1887                         -line => "???",
1888                         parameter => "'$prior_filename' also has NAME '$name'"
1889                     });
1890
1891                 # Changing the names helps later.
1892                 $prior_checker->name("$name version arbitrarily numbered 1");
1893                 $checker->name("$name version arbitrarily numbered 2");
1894             }
1895
1896             # In any event, don't process this pod that has the same name as
1897             # another.
1898             next FILE;
1899         }
1900
1901     process_this_pod:
1902
1903         # A unique pod.
1904         $id_to_checker{$id} = $checker;
1905
1906         my $parsed_for_links = ", but parsed for its interior links";
1907         if ((! $do_upstream_cpan && $filename =~ /^cpan/)
1908              || $filename =~ $only_for_interior_links_re)
1909         {
1910             if ($filename =~ /^cpan/) {
1911                 $checker->set_skip("CPAN is upstream for $filename");
1912             }
1913             elsif ($filename =~ /perl\d+delta/) {
1914                 if (! $do_deltas) {
1915                     $checker->set_skip("$filename is a stable perldelta");
1916                 }
1917             }
1918             elsif ($filename =~ /perltoc/) {
1919                 $checker->set_skip("$filename dependent on component pods");
1920             }
1921             else {
1922                 croak("Unexpected file '$filename' encountered that has parsing for interior-linking only");
1923             }
1924
1925             if ($name && $has_referred_to_node{$name}) {
1926                 $checker->set_skip($checker->get_skip() . $parsed_for_links);
1927             }
1928         }
1929
1930         # Need a name in order to process it, because not meaningful
1931         # otherwise, and also can't test links to this without a name.
1932         if (!defined $name) {
1933             $checker->poderror( { -msg => $no_name,
1934                                   -line => '???'
1935                                 });
1936             next FILE;
1937         }
1938
1939         # For skipped files, just get its NAME
1940         my $skip;
1941         if (($skip = $checker->get_skip()) && $skip !~ /$parsed_for_links/)
1942         {
1943             $checker->node($name) if $name;
1944         }
1945         elsif (! $parsed) {
1946             if (! $checker->parse_from_file($filename, undef)) {
1947                 $checker->set_skip("$filename is transitory");
1948                 next FILE;
1949             }
1950         }
1951
1952         # Go through everything in the file that could be an anchor that
1953         # could be a link target.  Count how many there are of the same name.
1954         foreach my $node ($checker->linkable_nodes) {
1955             next FILE if ! $node;        # Can be empty is like '=item *'
1956             $nodes{$name}{$node}++;
1957
1958             # Experiments have shown that cpan search can figure out the
1959             # target of a link even if the exact wording is incorrect, as long
1960             # as the first word is.  This happens frequently in perlfunc.pod,
1961             # where the link will be just to the function, but the target
1962             # entry also includes parameters to the function.
1963             my $first_word = $node;
1964             if ($first_word =~ s/^(\S+)\s+\S.*/$1/) {
1965                 $nodes_first_word{$name}{$first_word} = $node;
1966             }
1967         }
1968         $filename_to_pod{$filename} = $name;
1969     }
1970 }
1971
1972 # Here, all files have been parsed, and all links and link targets are stored.
1973 # Now go through the files again and see which don't have matches.
1974 if (! $has_input_files) {
1975     foreach my $filename (@files) {
1976         next if $filename_to_checker{$filename}->get_skip;
1977
1978         my $checker = $filename_to_checker{$filename};
1979         foreach my $link ($checker->hyperlinks()) {
1980             my $linked_to_page = $link->page;
1981             next unless $linked_to_page;   # intra-file checks are handled by std
1982                                            # Pod::Checker
1983             # Currently, we assume all external links are valid
1984             next if $link->type eq 'url';
1985
1986             # Initialize the potential message.
1987             my %problem = ( -msg => $broken_link,
1988                             -line => $link->line,
1989                             parameter => "to \"$linked_to_page\"",
1990                         );
1991
1992             # See if we have found the linked-to_file in our parse
1993             if (exists $nodes{$linked_to_page}) {
1994                 my $node = $link->node;
1995
1996                 # If link is only to the page-level, already have it
1997                 next if ! $node;
1998
1999                 # If link is to a node that exists in the file, is ok
2000                 if ($nodes{$linked_to_page}{$node}) {
2001
2002                     # But if the page has multiple targets with the same name,
2003                     # it's ambiguous which one this should be to.
2004                     if ($nodes{$linked_to_page}{$node} > 1) {
2005                         $problem{-msg} = $multiple_targets;
2006                         $problem{parameter} = "in $linked_to_page that $node could be pointing to";
2007                         $checker->poderror(\%problem);
2008                     }
2009                 } elsif (! $nodes_first_word{$linked_to_page}{$node}) {
2010
2011                     # Here the link target was not found, either exactly or to
2012                     # the first word.  Is an error.
2013                     $problem{parameter} =~ s,"$,/$node",;
2014                     $checker->poderror(\%problem);
2015                 }
2016
2017             } # Linked-to-file not in parse; maybe is in exception list
2018             elsif (! exists $valid_modules{$link->page}) {
2019
2020                 # Here, is a link to a target that we can't find.  Check if
2021                 # there is an internal link on the page with the target name.
2022                 # If so, it could be that they just forgot the initial '/'
2023                 # But perldelta is handled specially: only do this if the
2024                 # broken link isn't one of the known bad ones (that are
2025                 # placemarkers and should be removed for the final)
2026                 my $NAME = $filename_to_pod{$filename};
2027                 if (! defined $NAME) {
2028                     $checker->poderror(\%problem);
2029                 }
2030                 else {
2031                     if ($nodes{$NAME}{$linked_to_page}) {
2032                         $problem{-msg} =  $broken_internal_link;
2033                     }
2034                     $checker->poderror(\%problem);
2035                 }
2036             }
2037         }
2038     }
2039 }
2040
2041 # If regenerating the data file, start with the modules for which we don't
2042 # check targets.  If you change the sort order, you need to run --regen before
2043 # committing so that future commits that do run regen don't show irrelevant
2044 # changes.
2045 if ($regen) {
2046     foreach (sort { lc $a cmp lc $b } keys %valid_modules) {
2047         my_safer_print($copy_fh, $_, "\n");
2048     }
2049 }
2050
2051 # Now ready to output the messages.
2052 foreach my $filename (@files) {
2053     my $canonical = canonicalize($filename);
2054     SKIP: {
2055         my $skip = $filename_to_checker{$filename}->get_skip // "";
2056
2057         if ($regen) {
2058             foreach my $message ( sort keys %{$problems{$filename}}) {
2059                 my $count;
2060
2061                 # Preserve a negative setting.
2062                 if ($known_problems{$canonical}{$message}
2063                     && $known_problems{$canonical}{$message} < 0)
2064                 {
2065                     $count = $known_problems{$canonical}{$message};
2066                 }
2067                 else {
2068                     $count = @{$problems{$filename}{$message}};
2069                 }
2070                 my_safer_print($copy_fh, $canonical . "\t$message\t$count\n");
2071             }
2072             next;
2073         }
2074
2075         skip($skip, 1) if $skip;
2076         my @diagnostics;
2077         my $thankful_diagnostics = 0;
2078         my $indent = '  ';
2079
2080         my $total_known = 0;
2081         foreach my $message ( sort keys %{$problems{$filename}}) {
2082             $known_problems{$canonical}{$message} = 0
2083                                     if ! $known_problems{$canonical}{$message};
2084             my $diagnostic = "";
2085             my $problem_count = scalar @{$problems{$filename}{$message}};
2086             $total_known += $problem_count;
2087             next if $known_problems{$canonical}{$message} < 0;
2088             if ($problem_count > $known_problems{$canonical}{$message}) {
2089
2090                 # Here we are about to output all the messages for this type,
2091                 # subtract back this number we previously added in.
2092                 $total_known -= $problem_count;
2093
2094                 $diagnostic .= $indent . qq{"$message"};
2095                 if ($problem_count > 2) {
2096                     $diagnostic .= "  ($problem_count occurrences,"
2097                         . " expected $known_problems{$canonical}{$message})";
2098                 }
2099                 foreach my $problem (@{$problems{$filename}{$message}}) {
2100                     $diagnostic .= " " if $problem_count == 1;
2101                     $diagnostic .= "\n$indent$indent";
2102                     $diagnostic .= "$problem->{parameter}" if $problem->{parameter};
2103                     $diagnostic .= " near line $problem->{-line} of "
2104                                    . $filename;
2105                     $diagnostic .= " $problem->{comment}" if $problem->{comment};
2106                 }
2107                 $diagnostic .= "\n";
2108                 $files_with_unknown_issues{$filename} = 1;
2109             } elsif ($problem_count < $known_problems{$canonical}{$message}) {
2110                $diagnostic = output_thanks($filename, $known_problems{$canonical}{$message}, $problem_count, $message);
2111                $thankful_diagnostics++;
2112             }
2113             push @diagnostics, $diagnostic if $diagnostic;
2114         }
2115
2116         # The above loop has output messages where there are current potential
2117         # issues.  But it misses where there were some that have been entirely
2118         # fixed.  For those, we need to look through the old issues
2119         foreach my $message ( sort keys %{$known_problems{$canonical}}) {
2120             next if $problems{$filename}{$message};
2121             next if ! $known_problems{$canonical}{$message};
2122             next if $known_problems{$canonical}{$message} < 0; # Preserve negs
2123
2124             next if !$pedantic and $message =~
2125                 /^(?:\Q$line_length\E|\Q$C_not_linked\E|\Q$C_with_slash\E)/;
2126
2127             my $diagnostic = output_thanks($filename, $known_problems{$canonical}{$message}, 0, $message);
2128             push @diagnostics, $diagnostic if $diagnostic;
2129             $thankful_diagnostics++ if $diagnostic;
2130         }
2131
2132         my $output = "POD of $filename";
2133         $output .= ", excluding $total_known not shown known potential problems"
2134                                                                 if $total_known;
2135         if (@diagnostics && @diagnostics == $thankful_diagnostics) {
2136             # Output fixed issues as passing to-do tests, so they do not
2137             # cause failures, but t/harness still flags them.
2138             $output .= " # TODO"
2139         }
2140         ok(@diagnostics == $thankful_diagnostics, $output);
2141         if (@diagnostics) {
2142             diag(join "", @diagnostics,
2143             "See end of this test output for your options on silencing this");
2144         }
2145
2146         delete $known_problems{$canonical};
2147     }
2148 }
2149
2150 if (! $regen
2151     && ! ok (keys %known_problems == 0, "The known problems database ($data_dir/known_pod_issues.dat) includes no references to non-existent files"))
2152 {
2153     note("The following files were not found: "
2154          . join ", ", keys %known_problems);
2155     note("They will automatically be removed from the db the next time");
2156     note("  cd t; ./perl -I../lib porting/podcheck.t --regen");
2157     note("is run");
2158 }
2159
2160 my $how_to = <<EOF;
2161    run this test script by hand, using the following formula (on
2162    Un*x-like machines):
2163         cd t
2164         ./perl -I../lib porting/podcheck.t --regen
2165 EOF
2166
2167 if (%files_with_unknown_issues) {
2168     my $were_count_files = scalar keys %files_with_unknown_issues;
2169     $were_count_files = ($were_count_files == 1)
2170                         ? "was $were_count_files file"
2171                         : "were $were_count_files files";
2172     my $message = <<EOF;
2173
2174 HOW TO GET ${\__FILE__} TO PASS
2175
2176 There $were_count_files that had new potential problems identified.
2177 Some of them may be real, and some of them may be false positives because
2178 this program isn't as smart as it likes to think it is.  You can teach this
2179 program to ignore the issues it has identified, and hence pass, by doing the
2180 following:
2181
2182 1) If a problem is about a link to an unknown module or man page that
2183    you know exists, re-run the command something like:
2184       ./perl -I../lib porting/podcheck.t --add_link MODULE man_page ...
2185    (MODULEs should look like Foo::Bar, and man_pages should look like
2186    bar(3c); don't do this for a module or man page that you aren't sure
2187    about; instead treat as another type of issue and follow the
2188    instructions below.)
2189
2190 2) For other issues, decide if each should be fixed now or not.  Fix the
2191    ones you decided to, and rerun this test to verify that the fixes
2192    worked.
2193
2194 3) If there remain false positive or problems that you don't plan to fix right
2195    now,
2196 $how_to
2197    That should cause all current potential problems to be accepted by
2198    the program, so that the next time it runs, they won't be flagged.
2199 EOF
2200     if (%files_with_fixes) {
2201         $message .= "   This step will also take care of the files that have fixes in them\n";
2202     }
2203
2204     $message .= <<EOF;
2205    For a few files, such as perltoc, certain issues will always be
2206    expected, and more of the same will be added over time.  For those,
2207    before you do the regen, you can edit
2208    $known_issues
2209    and find the entry for the module's file and specific error message,
2210    and change the count of known potential problems to -1.
2211 EOF
2212
2213     diag($message);
2214 } elsif (%files_with_fixes) {
2215     diag(<<EOF
2216 To teach this test script that the potential problems have been fixed,
2217 $how_to
2218 EOF
2219     );
2220 }
2221
2222 if ($regen) {
2223     chdir $original_dir || die "Can't change directories to $original_dir";
2224     close_and_rename($copy_fh);
2225 }
2226
2227 1;