This is a live mirror of the Perl 5 development currently hosted at https://github.com/perl/perl5
utf8.c: pod clarification
[perl5.git] / t / porting / podcheck.t
CommitLineData
a67b1afa 1#!/usr/bin/perl -w
f7b649f0 2
faeaf995
FC
3BEGIN {
4 chdir 't';
5 unshift @INC, "../lib";
6}
7
a67b1afa 8use strict;
16384ac1
KW
9use warnings;
10use feature 'unicode_strings';
11
12use Carp;
de4da25d 13use Config;
16384ac1
KW
14use Digest;
15use File::Find;
16use File::Spec;
17use Scalar::Util;
18use Text::Tabs;
19
20BEGIN {
21 require '../regen/regen_lib.pl';
22}
23
24sub DEBUG { 0 };
25
26=pod
27
28=head1 NAME
29
30podcheck.t - Look for possible problems in the Perl pods
31
32=head1 SYNOPSIS
33
34 cd t
d9e2eb4b
KW
35 ./perl -I../lib porting/podcheck.t [--show_all] [--cpan] [--deltas]
36 [--counts] [ FILE ...]
477100f8
KW
37 ./perl -I../lib porting/podcheck.t --add_link MODULE ...
38
16384ac1
KW
39 ./perl -I../lib porting/podcheck.t --regen
40
41=head1 DESCRIPTION
42
43podcheck.t is an extension of Pod::Checker. It looks for pod errors and
44potential errors in the files given as arguments, or if none specified, in all
bc20e6b8
KW
45pods in the distribution workspace, except certain known special ones
46(specified below). It does additional checking beyond that done by
16384ac1
KW
47Pod::Checker, and keeps a database of known potential problems, and will
48fail a pod only if the number of such problems differs from that given in the
49database. It also suppresses the C<(section) deprecated> message from
50Pod::Checker, since specifying the man page section number is quite proper to do.
51
52The additional checks it makes are:
53
54=over
55
56=item Cross-pod link checking
57
58Pod::Checker verifies that links to an internal target in a pod are not
59broken. podcheck.t extends that (when called without FILE arguments) to
60external links. It does this by gathering up all the possible targets in the
477100f8
KW
61workspace, and cross-checking them. It also checks that a non-broken link
62points to just one target. (The destination pod could have two targets with
63the same name.)
64
65The way that the C<LE<lt>E<gt>> pod command works (for links outside the pod)
66is to actually create a link to C<search.cpan.org> with an embedded query for
67the desired pod or man page. That means that links outside the distribution
68are valid. podcheck.t doesn't verify the validity of such links, but instead
69keeps a data base of those known to be valid. This means that if a link to a
70target not on the list is created, the target needs to be added to the data
71base. This is accomplished via the L<--add_link|/--add_link MODULE ...>
72option to podcheck.t, described below.
16384ac1
KW
73
74=item An internal link that isn't so specified
75
76If a link is broken, but there is an existing internal target of the same
77name, it is likely that the internal target was meant, and the C<"/"> is
78missing from the C<LE<lt>E<gt>> pod command.
79
b879da0f 80=item Verbatim paragraphs that wrap in an 80 (including 1 spare) column window
16384ac1
KW
81
82It's annoying to have lines wrap when displaying pod documentation in a
b879da0f
KW
83terminal window. This checks that all verbatim lines fit in a standard 80
84column window, even when using a pager that reserves a column for its own use.
85(Thus the check is for a net of 79 columns.)
5b1cac40 86For those lines that don't fit, it tells you how much needs to be cut in
b879da0f 87order to fit.
16384ac1
KW
88
89Often, the easiest thing to do to gain space for these is to lower the indent
90to just one space.
91
92=item Missing or duplicate NAME or missing NAME short description
93
94A pod can't be linked to unless it has a unique name.
95And a NAME should have a dash and short description after it.
96
97=item =encoding statement issues
98
99This indicates if an C<=encoding> statement should be present, or moved to the
100front of the pod.
101
102=item Items that perhaps should be links
103
104There are mentions of apparent files in the pods that perhaps should be links
105instead, using C<LE<lt>...E<gt>>
106
107=item Items that perhaps should be C<FE<lt>...E<gt>>
108
109What look like path names enclosed in C<CE<lt>...E<gt>> should perhaps have
110C<FE<lt>...E<gt>> mark-up instead.
111
112=back
113
114A number of issues raised by podcheck.t and by the base Pod::Checker are not
bc20e6b8
KW
115really problems, but merely potential problems, that is, false positives.
116After inspecting them and
16384ac1 117deciding that they aren't real problems, it is possible to shut up this program
204c2383
KW
118about them, unlike base Pod::Checker. For a valid link to an outside module
119or man page, call podcheck.t with the C<--add_link> option to add it to the
120the database of known links; for other causes, call podcheck.t with the C<--regen>
121option to regenerate the entire database. This tells it that all existing
16384ac1
KW
122issues are to not be mentioned again.
123
204c2383 124C<--regen> isn't fool-proof. The database merely keeps track of the number of these
16384ac1
KW
125potential problems of each type for each pod. If a new problem of a given
126type is introduced into the pod, podcheck.t will spit out all of them. You
127then have to figure out which is the new one, and should it be changed or not.
128But doing it this way insulates the database from having to keep track of line
129numbers of problems, which may change, or the exact wording of each problem
130which might also change without affecting whether it is a problem or not.
131
132Also, if the count of potential problems of a given type for a pod decreases,
133the database must be regenerated so that it knows the new number. The program
134gives instructions when this happens.
135
bc20e6b8
KW
136Some pods will have varying numbers of problems of a given type. This can
137be handled by manually editing the database file (see L</FILES>), and setting
138the number of those problems for that pod to a negative number. This will
139cause the corresponding error to always be suppressed no matter how many there
140actually are.
141
204c2383
KW
142Another problem is that there is currently no check that modules listed as
143valid in the data base
16384ac1
KW
144actually are. Thus any errors introduced there will remain there.
145
bc20e6b8
KW
146=head2 Specially handled pods
147
148=over
149
150=item perltoc
151
152This pod is generated by pasting bits from other pods. Errors in those bits
153will show up as errors here, as well as for those other pods. Therefore
154errors here are suppressed, and the pod is checked only to verify that nodes
204c2383 155within it actually exist that are externally linked to.
bc20e6b8
KW
156
157=item perldelta
158
159The current perldelta pod is initialized from a template that contains
160placeholder text. Some of this text is in the form of links that don't really
161exist. Any such links that are listed in C<@perldelta_ignore_links> will not
162generate messages. It is presumed that these links will be cleaned up when
163the perldelta is cleaned up for release since they should be marked with
164C<XXX>.
165
166=item Porting/perldelta_template.pod
167
168This is not a pod, but a template for C<perldelta>. Any errors introduced
169here will show up when C<perldelta> is created from it.
170
171=item cpan-upstream pods
172
173See the L</--cpan> option documentation
174
175=item old perldeltas
176
177See the L</--deltas> option documentation
178
179=back
180
16384ac1
KW
181=head1 OPTIONS
182
183=over
184
477100f8
KW
185=item --add_link MODULE ...
186
187Use this option to teach podcheck.t that the C<MODULE>s or man pages actually
188exist, and to silence any messages that links to them are broken.
189
190podcheck.t checks that links within the Perl core distribution are valid, but
191it doesn't check links to man pages or external modules. When it finds
192a broken link, it checks its data base of external modules and man pages,
193and only if not found there does it raise a message. This option just adds
194the list of modules and man page references that follow it on the command line
195to that data base.
196
197For example,
198
199 cd t
243a655d 200 ./perl -I../lib porting/podcheck.t --add_link Unicode::Casing
477100f8
KW
201
202causes the external module "Unicode::Casing" to be added to the data base, so
12b4b03c 203C<LE<lt>Unicode::CasingE<gt>> will be considered valid.
477100f8 204
16384ac1
KW
205=item --regen
206
207Regenerate the data base used by podcheck.t to include all the existing
208potential problems. Future runs of the program will not then flag any of
209these.
210
211=item --cpan
212
213Normally, all pods in the cpan directory are skipped, except to make sure that
214any blead-upstream links to such pods are valid.
bc20e6b8 215This option will cause cpan upstream pods to be fully checked.
16384ac1 216
d9e2eb4b
KW
217=item --deltas
218
219Normally, all old perldelta pods are skipped, except to make sure that
220any links to such pods are valid. This is because they are considered
221stable, and perhaps trying to fix them will cause changes that will
bc20e6b8
KW
222misrepresent Perl's history. But, this option will cause them to be fully
223checked.
d9e2eb4b 224
16384ac1
KW
225=item --show_all
226
227Normally, if the number of potential problems of a given type found for a
228pod matches the expected value in the database, they will not be displayed.
229This option forces the database to be ignored during the run, so all potential
230problems are displayed and will fail their respective pod test. Specifying
231any particular FILES to operate on automatically selects this option.
232
233=item --counts
234
235Instead of testing, this just dumps the counts of the occurrences of the
236various types of potential problems in the data base.
237
238=back
239
240=head1 FILES
241
242The database is stored in F<t/porting/known_pod_issues.dat>
243
244=head1 SEE ALSO
245
246L<Pod::Checker>
247
248=cut
249
ae54198b
KW
250# VMS builds have a '.com' appended to utility and script names, and it adds a
251# trailing dot for any other file name that doesn't have a dot in it. The db
252# is stored without those things. This regex allows for these special file
253# names to be dealt with. It needs to be interpolated into a larger regex
254# that furnishes the closing boundary.
255my $vms_re = qr/ \. (?: com )? /x;
256
257# Some filenames in the MANIFEST match $vms_re, and so must not be handled the
258# same way that that the special vms ones are. This hash lists those.
259my %special_vms_files;
260
21b63541
KW
261# This is to get this to work across multiple file systems, including those
262# that are not case sensitive. The db is stored in lower case, Un*x style,
263# and all file name comparisons are done that way.
264sub canonicalize($) {
265 my $input = shift;
266 my ($volume, $directories, $file)
267 = File::Spec->splitpath(File::Spec->canonpath($input));
268 # Assumes $volume is constant for everything in this directory structure
269 $directories = "" if ! $directories;
270 $file = "" if ! $file;
ae54198b
KW
271 $file = lc join '/', File::Spec->splitdir($directories), $file;
272 $file =~ s! / /+ !/!gx; # Multiple slashes => single slash
273
274 # The db is stored without the special suffixes that are there in VMS, so
275 # strip them off to get the comparable name. But some files on all
276 # platforms have these suffixes, so this shouldn't happen for them, as any
277 # of their db entries will have the suffixes in them. The hash has been
278 # populated with these files.
279 if ($^O eq 'VMS'
280 && $file =~ / ( $vms_re ) $ /x
281 && ! exists $special_vms_files{$file})
282 {
283 $file =~ s/ $1 $ //x;
284 }
285 return $file;
21b63541
KW
286}
287
16384ac1
KW
288#####################################################
289# HOW IT WORKS (in general)
290#
291# If not called with specific files to check, the directory structure is
292# examined for files that have pods in them. Files that might not have to be
293# fully parsed (e.g. in cpan) are parsed enough at this time to find their
294# pod's NAME, and to get a checksum.
295#
296# Those kinds of files are sorted last, but otherwise the pods are parsed with
297# the package coded here, My::Pod::Checker, which is an extension to
298# Pod::Checker that adds some tests and suppresses others that aren't
299# appropriate. The latter module has no provision for capturing diagnostics,
300# so a package, Tie_Array_to_FH, is used to force them to be placed into an
301# array instead of printed.
302#
303# Parsing the files builds up a list of links. The files are gone through
304# again, doing cross-link checking and outputting all saved-up problems with
305# each pod.
306#
307# Sorting the files last that potentially don't need to be fully parsed allows
308# us to not parse them unless there is a link to an internal anchor in them
309# from something that we have already parsed. Keeping checksums allows us to
310# not parse copies of other pods.
311#
312#####################################################
313
314# 1 => Exclude low priority messages that aren't likely to be problems, and
315# has many false positives; higher numbers give more messages.
316my $Warnings_Level = 200;
317
b9b741c8
KW
318# perldelta during construction may have place holder links. N.B. This
319# variable is referred to by name in release_managers_guide.pod
2feebde0 320our @perldelta_ignore_links = ( "XXX", "perl5YYYdelta", "perldiag/message" );
1c01047d 321
16384ac1
KW
322# To see if two pods with the same NAME are actually copies of the same pod,
323# which is not an error, it uses a checksum to save work.
324my $digest_type = "SHA-1";
325
326my $original_dir = File::Spec->rel2abs(File::Spec->curdir);
327my $data_dir = File::Spec->catdir($original_dir, 'porting');
328my $known_issues = File::Spec->catfile($data_dir, 'known_pod_issues.dat');
c3e757a4 329my $MANIFEST = File::Spec->catfile(File::Spec->updir($original_dir), 'MANIFEST');
16384ac1
KW
330my $copy_fh;
331
b879da0f 332my $MAX_LINE_LENGTH = 79; # 79 columns
46b134f4 333my $INDENT = 7; # default nroff indent
16384ac1
KW
334
335# Our warning messages. Better not have [('"] in them, as those are used as
336# delimiters for variable parts of the messages by poderror.
337my $line_length = "Verbatim line length including indents exceeds $MAX_LINE_LENGTH by";
338my $broken_link = "Apparent broken link";
339my $broken_internal_link = "Apparent internal link is missing its forward slash";
340my $see_not_linked = "? Should you be using L<...> instead of";
341my $C_with_slash = "? Should you be using F<...> or maybe L<...> instead of";
342my $multiple_targets = "There is more than one target";
343my $duplicate_name = "Pod NAME already used";
344my $need_encoding = "Should have =encoding statement because have non-ASCII";
345my $encoding_first = "=encoding must be first command (if present)";
346my $no_name = "There is no NAME";
347my $missing_name_description = "The NAME should have a dash and short description after it";
348
7f8d58fb 349# objects, tests, etc can't be pods, so don't look for them. Also skip
a71e7b2c
KW
350# files output by the patch program. Could also ignore most of .gitignore
351# files, but not all, so don't.
de4da25d
CB
352
353my $obj_ext = $Config{'obj_ext'}; $obj_ext =~ tr/.//d; # dot will be added back
354my $lib_ext = $Config{'lib_ext'}; $lib_ext =~ tr/.//d;
355my $lib_so = $Config{'so'}; $lib_so =~ tr/.//d;
356my $dl_ext = $Config{'dlext'}; $dl_ext =~ tr/.//d;
357
6981f6e4
KW
358# Not really pods, but can look like them.
359my %excluded_files = (
c841a192 360 canonicalize("lib/unicore/mktables") => 1,
a9df55d5 361 canonicalize("Porting/make-rmg-checklist") => 1,
c841a192 362 canonicalize("Porting/perldelta_template.pod") => 1,
e233f0c5 363 canonicalize("regen/feature.pl") => 1,
c841a192
KW
364 canonicalize("autodoc.pl") => 1,
365 canonicalize("configpm") => 1,
366 canonicalize("miniperl") => 1,
367 canonicalize("perl") => 1,
00e518b3
FR
368 canonicalize('cpan/Pod-Perldoc/corpus/no-head.pod') => 1,
369 canonicalize('cpan/Pod-Perldoc/corpus/perlfunc.pod') => 1,
370 canonicalize('cpan/Pod-Perldoc/corpus/utf8.pod') => 1,
9c3bdfaf 371 canonicalize("lib/unicore/mktables") => 1,
6981f6e4
KW
372 );
373
df80274d
KW
374# This list should not include anything for which case sensitivity is
375# important, as it won't work on VMS, and won't show up until tested on VMS.
c3e757a4
KW
376# All or almost all such files should be listed in the MANIFEST, so that can
377# be examined for them, and each such file explicitly excluded, as is done for
378# .PL files in the loop just below this. For files not catchable this way,
379# is_pod_file() can be used to exclude these at a finer grained level.
a71e7b2c 380my $non_pods = qr/ (?: \.
df80274d 381 (?: [achot] | zip | gz | bz2 | jar | tar | tgz
a71e7b2c
KW
382 | orig | rej | patch # Patch program output
383 | sw[op] | \#.* # Editor droppings
a65bcb92 384 | old # buildtoc output
de4da25d
CB
385 | xs # pod should be in the .pm file
386 | al # autosplit files
387 | bs # bootstrap files
388 | (?i:sh) # shell scripts, hints, templates
389 | lst # assorted listing files
390 | bat # Windows,Netware,OS2 batch files
391 | cmd # Windows,Netware,OS2 command files
392 | lis # VMS compiler listings
393 | map # VMS linker maps
394 | opt # VMS linker options files
395 | mms # MM(K|S) description files
396 | ts # timestamp files generated during build
397 | $obj_ext # object files
398 | exe # $Config{'exe_ext'} might be empty string
399 | $lib_ext # object libraries
400 | $lib_so # shared libraries
401 | $dl_ext # dynamic libraries
a71e7b2c
KW
402 )
403 $
e0a28d18 404 ) | ~$ | \ \(Autosaved\)\.txt$ # Other editor droppings
de4da25d
CB
405 | ^cxx\$demangler_db\.$ # VMS name mangler database
406 | ^typemap\.?$ # typemap files
407 | ^(?i:Makefile\.PL)$
a71e7b2c 408 /x;
16384ac1 409
c3e757a4
KW
410# '.PL' files should be excluded, as they aren't final pods, but often contain
411# material used in generating pods, and so can look like a pod. We can't use
412# the regexp above because case sensisitivity is important for these, as some
413# '.pl' files should be examined for pods. Instead look through the MANIFEST
414# for .PL files and get their full path names, so we can exclude each such
415# file explicitly. This works because other porting tests prohibit having two
416# files with the same names except for case.
417open my $manifest_fh, '<:bytes', $MANIFEST or die "Can't open $MANIFEST";
418while (<$manifest_fh>) {
ae54198b
KW
419
420 # While we have MANIFEST open, on VMS platforms, look for files that match
421 # the magic VMS file names that have to be handled specially. Add these
422 # to the list of them.
423 if ($^O eq 'VMS' && / ^ ( [^\t]* $vms_re ) \t /x) {
424 $special_vms_files{$1} = 1;
425 }
c3e757a4
KW
426 if (/ ^ ( [^\t]* \. PL ) \t /x) {
427 $excluded_files{canonicalize($1)} = 1;
428 }
429}
430close $manifest_fh, or die "Can't close $MANIFEST";
431
16384ac1
KW
432
433# Pod::Checker messages to suppress
434my @suppressed_messages = (
435 "(section) in", # Checker is wrong to flag this
436 "multiple occurrence of link target", # We catch independently the ones
437 # that are real problems.
438 "unescaped <>",
19f4a855 439 "Entity number out of range", # Checker outputs this for anything above
8505a15d 440 # 255, but in fact all Unicode is valid
16384ac1
KW
441);
442
443sub suppressed {
444 # Returns bool as to if input message is one that is to be suppressed
445
446 my $message = shift;
447 return grep { $message =~ /^\Q$_/i } @suppressed_messages;
448}
449
450{ # Closure to contain a simple subset of test.pl. This is to get rid of the
451 # unnecessary 'failed at' messages that would otherwise be output pointing
452 # to a particular line in this file.
a67b1afa 453
16384ac1
KW
454 my $current_test = 0;
455 my $planned;
456
457 sub plan {
458 my %plan = @_;
459 $planned = $plan{tests};
460 print "1..$planned\n";
461 return;
462 }
463
464 sub ok {
465 my $success = shift;
466 my $message = shift;
467
468 chomp $message;
469
470 $current_test++;
471 print "not " unless $success;
472 print "ok $current_test - $message\n";
473 return;
474 }
475
476 sub skip {
477 my $why = shift;
478 my $n = @_ ? shift : 1;
479 for (1..$n) {
480 $current_test++;
481 print "ok $current_test # skip $why\n";
482 }
483 no warnings 'exiting';
484 last SKIP;
485 }
486
487 sub note {
488 my $message = shift;
489
490 chomp $message;
491
492 print $message =~ s/^/# /mgr;
493 print "\n";
494 return;
495 }
496
497 END {
498 if ($planned && $planned != $current_test) {
499 print STDERR
500 "# Looks like you planned $planned tests but ran $current_test.\n";
501 }
502 }
503}
504
16384ac1
KW
505# List of known potential problems by pod and type.
506my %known_problems;
507
508# Pods given by the keys contain an interior node that is referred to from
509# outside it.
510my %has_referred_to_node;
511
512my $show_counts = 0;
513my $regen = 0;
477100f8 514my $add_link = 0;
16384ac1
KW
515my $show_all = 0;
516
d9e2eb4b
KW
517my $do_upstream_cpan = 0; # Assume that are to skip anything in /cpan
518my $do_deltas = 0; # And stable perldeltas
16384ac1
KW
519
520while (@ARGV && substr($ARGV[0], 0, 1) eq '-') {
521 my $arg = shift @ARGV;
522
523 $arg =~ s/^--/-/; # Treat '--' the same as a single '-'
524 if ($arg eq '-regen') {
525 $regen = 1;
526 }
477100f8
KW
527 elsif ($arg eq '-add_link') {
528 $add_link = 1;
529 }
16384ac1
KW
530 elsif ($arg eq '-cpan') {
531 $do_upstream_cpan = 1;
532 }
d9e2eb4b
KW
533 elsif ($arg eq '-deltas') {
534 $do_deltas = 1;
535 }
16384ac1
KW
536 elsif ($arg eq '-show_all') {
537 $show_all = 1;
538 }
539 elsif ($arg eq '-counts') {
540 $show_counts = 1;
541 }
542 else {
543 die <<EOF;
544Unknown option '$arg'
545
477100f8
KW
546Usage: $0 [ --regen | --cpan | --show_all | FILE ... | --add_link MODULE ... ]\n"
547 --add_link -> Add the MODULE and man page references to the data base
16384ac1 548 --regen -> Regenerate the data file for $0
477100f8 549 --cpan -> Include files in the cpan subdirectory.
d9e2eb4b 550 --deltas -> Include stable perldeltas
16384ac1
KW
551 --show_all -> Show all known potential problems
552 --counts -> Don't test, but give summary counts of the currently
553 existing database
554EOF
555 }
556}
557
558my @files = @ARGV;
559
d9e2eb4b
KW
560my $cpan_or_deltas = $do_upstream_cpan || $do_deltas;
561if (($regen + $show_all + $show_counts + $add_link + $cpan_or_deltas ) > 1) {
562 croak "--regen, --show_all, --counts, and --add_link are mutually exclusive\n and none can be run with --cpan nor --deltas";
16384ac1
KW
563}
564
565my $has_input_files = @files;
566
d9e2eb4b
KW
567if ($has_input_files
568 && ($regen || $show_counts || $do_upstream_cpan || $do_deltas))
569{
570 croak "--regen, --counts, --deltas, and --cpan can't be used since using specific files";
16384ac1
KW
571}
572
477100f8
KW
573if ($add_link && ! $has_input_files) {
574 croak "--add_link requires at least one module or man page reference";
575}
576
16384ac1
KW
577our %problems; # potential problems found in this run
578
579package My::Pod::Checker { # Extend Pod::Checker
a67b1afa
MM
580 use parent 'Pod::Checker';
581
16384ac1
KW
582 # Uses inside out hash to protect from typos
583 # For new fields, remember to add to destructor DESTROY()
584 my %indents; # Stack of indents from =over's in effect for
585 # current line
586 my %current_indent; # Current line's indent
587 my %filename; # The pod is store in this file
588 my %skip; # is SKIP set for this pod
589 my %in_NAME; # true if within NAME section
590 my %in_begin; # true if within =begin section
591 my %linkable_item; # Bool: if the latest =item is linkable. It isn't
592 # for bullet and number lists
593 my %linkable_nodes; # Pod::Checker adds all =items to its node list,
594 # but not all =items are linkable to
595 my %seen_encoding_cmd; # true if have =encoding earlier
596 my %command_count; # Number of commands seen
597 my %seen_pod_cmd; # true if have =pod earlier
598 my %warned_encoding; # true if already have warned about =encoding
599 # problems
600
601 sub DESTROY {
602 my $addr = Scalar::Util::refaddr $_[0];
603 delete $command_count{$addr};
604 delete $current_indent{$addr};
605 delete $filename{$addr};
606 delete $in_begin{$addr};
607 delete $indents{$addr};
608 delete $in_NAME{$addr};
609 delete $linkable_item{$addr};
610 delete $linkable_nodes{$addr};
611 delete $seen_encoding_cmd{$addr};
612 delete $seen_pod_cmd{$addr};
613 delete $skip{$addr};
614 delete $warned_encoding{$addr};
615 return;
616 }
617
618 sub new {
619 my $class = shift;
620 my $filename = shift;
621
622 my $self = $class->SUPER::new(-quiet => 1,
623 -warnings => $Warnings_Level);
624 my $addr = Scalar::Util::refaddr $self;
625 $command_count{$addr} = 0;
626 $current_indent{$addr} = 0;
627 $filename{$addr} = $filename;
628 $in_begin{$addr} = 0;
629 $in_NAME{$addr} = 0;
630 $linkable_item{$addr} = 0;
631 $seen_encoding_cmd{$addr} = 0;
632 $seen_pod_cmd{$addr} = 0;
633 $warned_encoding{$addr} = 0;
634 return $self;
635 }
636
637 # re's for messages that Pod::Checker outputs
638 my $location = qr/ \b (?:in|at|on|near) \s+ /xi;
639 my $optional_location = qr/ (?: $location )? /xi;
3cb68c65 640 my $line_reference = qr/ [('"]? $optional_location \b line \s+
16384ac1
KW
641 (?: \d+ | EOF | \Q???\E | - )
642 [)'"]? /xi;
643
644 sub poderror { # Called to register a potential problem
645
646 # This adds an extra field to the parent hash, 'parameter'. It is
647 # used to extract the variable parts of a message leaving just the
648 # constant skeleton. This in turn allows the message to be
649 # categorized better, so that it shows up as a single type in our
650 # database, with the specifics of each occurrence not being stored with
651 # it.
652
653 my $self = shift;
654 my $opts = shift;
655
656 my $addr = Scalar::Util::refaddr $self;
657 return if $skip{$addr};
658
659 # Input can be a string or hash. If a string, parse it to separate
660 # out the line number and convert to a hash for easier further
661 # processing
662 my $message;
663 if (ref $opts ne 'HASH') {
664 $message = join "", $opts, @_;
665 my $line_number;
666 if ($message =~ s/\s*($line_reference)//) {
667 ($line_number = $1) =~ s/\s*$optional_location//;
668 }
669 else {
670 $line_number = '???';
671 }
672 $opts = { -msg => $message, -line => $line_number };
673 } else {
674 $message = $opts->{'-msg'};
675
676 }
677
678 $message =~ s/^\d+\s+//;
679 return if main::suppressed($message);
680
681 $self->SUPER::poderror($opts, @_);
682
683 $opts->{parameter} = "" unless $opts->{parameter};
684
685 # The variable parts of the message tend to be enclosed in '...',
686 # "....", or (...). Extract them and put them in an extra field,
687 # 'parameter'. This is trickier because the matching delimiter to a
688 # '(' is its mirror, and not itself. Text::Balanced could be used
689 # instead.
690 while ($message =~ m/ \s* $optional_location ( [('"] )/xg) {
691 my $delimiter = $1;
692 my $start = $-[0];
693 $delimiter = ')' if $delimiter eq '(';
694
695 # If there is no ending delimiter, don't consider it to be a
696 # variable part. Most likely it is a contraction like "Don't"
697 last unless $message =~ m/\G .+? \Q$delimiter/xg;
698
699 my $length = $+[0] - $start;
700
701 # Get the part up through the closing delimiter
702 my $special = substr($message, $start, $length);
703 $special =~ s/^\s+//; # No leading whitespace
704
705 # And add that variable part to the parameter, while removing it
706 # from the message. This isn't a foolproof way of finding the
707 # variable part. For example '(s)' can occur in e.g.,
708 # 'paragraph(s)'
709 if ($special ne '(s)') {
710 substr($message, $start, $length) = "";
711 pos $message = $start;
712 $opts->{-msg} = $message;
713 $opts->{parameter} .= " " if $opts->{parameter};
714 $opts->{parameter} .= $special;
715 }
716 }
717
718 # Extract any additional line number given. This is often the
719 # beginning location of something whereas the main line number gives
720 # the ending one.
721 if ($message =~ /( $line_reference )/xi) {
722 my $line_ref = $1;
723 while ($message =~ s/\s*\Q$line_ref//) {
724 $opts->{-msg} = $message;
725 $opts->{parameter} .= " " if $opts->{parameter};
726 $opts->{parameter} .= $line_ref;
727 }
728 }
729
b3fdb838 730 Carp::carp("Couldn't extract line number from '$message'") if $message =~ /line \d+/;
16384ac1
KW
731 push @{$problems{$filename{$addr}}{$message}}, $opts;
732 #push @{$problems{$self->get_filename}{$message}}, $opts;
733 }
734
735 sub check_encoding { # Does it need an =encoding statement?
736 my ($self, $paragraph, $line_num, $pod_para) = @_;
737
738 # Do nothing if there is an =encoding in the file, or if the line
739 # doesn't require an =encoding, or have already warned.
740 my $addr = Scalar::Util::refaddr $self;
741 return if $seen_encoding_cmd{$addr}
742 || $warned_encoding{$addr}
743 || $paragraph !~ /\P{ASCII}/;
744
745 $warned_encoding{$addr} = 1;
746 my ($file, $line) = $pod_para->file_line;
747 $self->poderror({ -line => $line, -file => $file,
748 -msg => $need_encoding
749 });
750 return;
751 }
752
753 sub verbatim {
754 my ($self, $paragraph, $line_num, $pod_para) = @_;
755 $self->check_encoding($paragraph, $line_num, $pod_para);
756
757 $self->SUPER::verbatim($paragraph, $line_num, $pod_para);
758
2592f3b8
DG
759 my $addr = Scalar::Util::refaddr $self;
760
16384ac1
KW
761 # Pick up the name, since the parent class doesn't in verbatim
762 # NAMEs; so treat as non-verbatim. The parent class only allows one
763 # paragraph in a NAME section, so if there is an extra blank line, it
764 # will trigger a message, but such a blank line is harmless, so skip
765 # in that case.
2592f3b8 766 if ($in_NAME{$addr} && $paragraph =~ /\S/) {
16384ac1
KW
767 $self->textblock($paragraph, $line_num, $pod_para);
768 }
769
770 my @lines = split /^/, $paragraph;
771 for my $i (0 .. @lines - 1) {
2592f3b8
DG
772 if ( my $encoding = $seen_encoding_cmd{$addr} ) {
773 require Encode;
774 $lines[$i] = Encode::decode($encoding, $lines[$i]);
775 }
16384ac1
KW
776 $lines[$i] =~ s/\s+$//;
777 my $indent = $self->get_current_indent;
778 my $exceeds = length(Text::Tabs::expand($lines[$i]))
2cd46bfd 779 + $indent - $MAX_LINE_LENGTH;
16384ac1
KW
780 next unless $exceeds > 0;
781 my ($file, $line) = $pod_para->file_line;
782 $self->poderror({ -line => $line + $i, -file => $file,
783 -msg => $line_length,
784 parameter => "+$exceeds (including " . ($indent - $INDENT) . " from =over's)",
785 });
786 }
787 }
788
789 sub textblock {
790 my ($self, $paragraph, $line_num, $pod_para) = @_;
791 $self->check_encoding($paragraph, $line_num, $pod_para);
792
793 $self->SUPER::textblock($paragraph, $line_num, $pod_para);
794
795 my ($file, $line) = $pod_para->file_line;
796 my $addr = Scalar::Util::refaddr $self;
797 if ($in_NAME{$addr}) {
798 if (! $self->name) {
799 my $text = $self->interpolate($paragraph, $line_num);
800 if ($text =~ /^\s*(\S+?)\s*$/) {
801 $self->name($1);
802 $self->poderror({ -line => $line, -file => $file,
803 -msg => $missing_name_description,
804 parameter => $1});
805 }
806 }
807 }
808 $paragraph = join " ", split /^/, $paragraph;
809
810 # Matches something that looks like a file name, but is enclosed in
811 # C<...>
812 my $C_path_re = qr{ \b ( C<
a44ad494
KW
813 # exclude various things that have slashes
814 # in them but aren't paths
815 (?!
816 (?: (?: s | qr | m) / ) # regexes
817 | \d+/\d+> # probable fractions
818 | OS/2>
819 | Perl/Tk>
820 | origin/blead>
821 | origin/maint
822 | - # File names don't begin with "-"
823 )
824 [-\w]+ (?: / [-\w]+ )+ (?: \. \w+ )? > )
16384ac1
KW
825 }x;
826
827 # If looks like a reference to other documentation by containing the
828 # word 'See' and then a likely pod directive, warn.
4e9f7dd4
KW
829 while ($paragraph =~ m{
830 ( (?: \w+ \s+ )* ) # The phrase before, if any
831 \b [Ss]ee \s+
832 ( ( [^L] )
833 <
834 ( [^<]*? ) # The not < excludes nested C<L<...
835 >
836 )
2ebce8af 837 ( \s+ (?: under | in ) \s+ L< )?
4e9f7dd4
KW
838 }xg) {
839 my $prefix = $1 // "";
840 my $construct = $2; # The whole thing, like C<...>
841 my $type = $3;
842 my $interior = $4;
843 my $trailing = $5; # After the whole thing ending in "L<"
844
845 # If the full phrase is something like, "you might see C<", or
846 # similar, it really isn't a reference to a link. The ones I saw
847 # all had the word "you" in them; and the "you" wasn't the
848 # beginning of a sentence.
849 if ($prefix !~ / \b you \b /x) {
2ebce8af 850
f274fe9f
KW
851 # Now, find what the module or man page name within the
852 # construct would be if it actually has L<> syntax. If it
853 # doesn't have that syntax, will set the module to the entire
854 # interior.
855 $interior =~ m/ ^
856 (?: [^|]+ \| )? # Optional arbitrary text ending
857 # in "|"
858 ( .+? ) # module, etc. name
859 (?: \/ .+ )? # target within module
860 $
861 /xs;
862 my $module = $1;
863 if (! defined $trailing # not referring to something in another
864 # section
865 && $interior !~ /$non_pods/
866
867 # C<> that look like files have their own message below, so
868 # exclude them
869 && $construct !~ /$C_path_re/g
870
871 # There can't be spaces (I think) in module names or man
872 # pages
873 && $module !~ / \s /x
874
875 # F<> that end in eg \.pl are almost certainly ok, as are
876 # those that look like a path with multiple "/" chars
877 && ($type ne "F"
878 || (! -e $interior
879 && $interior !~ /\.\w+$/
880 && $interior !~ /\/.+\//)
881 )
882 ) {
883 $self->poderror({ -line => $line, -file => $file,
884 -msg => $see_not_linked,
885 parameter => $construct
886 });
887 }
4e9f7dd4 888 }
16384ac1
KW
889 }
890 while ($paragraph =~ m/$C_path_re/g) {
891 my $construct = $1;
892 $self->poderror({ -line => $line, -file => $file,
893 -msg => $C_with_slash,
894 parameter => $construct
895 });
896 }
897 return;
898 }
899
900 sub command {
901 my ($self, $cmd, $paragraph, $line_num, $pod_para) = @_;
902 my $addr = Scalar::Util::refaddr $self;
903 if ($cmd eq "pod") {
904 $seen_pod_cmd{$addr}++;
905 }
906 elsif ($cmd eq "encoding") {
907 my ($file, $line) = $pod_para->file_line;
2592f3b8 908 $seen_encoding_cmd{$addr} = $paragraph; # for later decoding
16384ac1
KW
909 if ($command_count{$addr} != 1 && $seen_pod_cmd{$addr}) {
910 $self->poderror({ -line => $line, -file => $file,
911 -msg => $encoding_first
912 });
913 }
914 }
915 $self->check_encoding($paragraph, $line_num, $pod_para);
916
917 # Pod::Check treats all =items as linkable, but the bullet and
918 # numbered lists really aren't. So keep our own list. This has to be
919 # processed before SUPER is called so that the list is started before
920 # the rest of it gets parsed.
921 if ($cmd eq 'item') { # Not linkable if item begins with * or a digit
922 $linkable_item{$addr} = ($paragraph !~ / ^ \s*
923 (?: [*]
924 | \d+ \.? (?: \$ | \s+ )
925 )/x)
926 ? 1
927 : 0;
a67b1afa 928
16384ac1
KW
929 }
930 $self->SUPER::command($cmd, $paragraph, $line_num, $pod_para);
931
932 $command_count{$addr}++;
933
934 $in_NAME{$addr} = 0; # Will change to 1 below if necessary
935 $in_begin{$addr} = 0; # ibid
936 if ($cmd eq 'over') {
937 my $text = $self->interpolate($paragraph, $line_num);
938 my $indent = 4; # default
939 $indent = $1 if $text && $text =~ /^\s*(\d+)\s*$/;
940 push @{$indents{$addr}}, $indent;
941 $current_indent{$addr} += $indent;
942 }
943 elsif ($cmd eq 'back') {
944 if (@{$indents{$addr}}) {
945 $current_indent{$addr} -= pop @{$indents{$addr}};
946 }
947 else {
948 # =back without corresponding =over, but should have
949 # warned already
950 $current_indent{$addr} = 0;
951 }
952 }
953 elsif ($cmd =~ /^head/) {
954 if (! $in_begin{$addr}) {
955
956 # If a particular formatter, then this command doesn't really
957 # apply
958 $current_indent{$addr} = 0;
959 undef @{$indents{$addr}};
960 }
961
962 my $text = $self->interpolate($paragraph, $line_num);
963 $in_NAME{$addr} = 1 if $cmd eq 'head1'
964 && $text && $text =~ /^NAME\b/;
965 }
966 elsif ($cmd eq 'begin') {
967 $in_begin{$addr} = 1;
968 }
969
970 return;
971 }
972
973 sub hyperlink {
974 my $self = shift;
975
02987562
KW
976 my $page;
977 if ($_[0] && ($page = $_[0][1]{'-page'})) {
978 my $node = $_[0][1]{'-node'};
979
980 # If the hyperlink is to an interior node of another page, save it
981 # so that we can see if we need to parse normally skipped files.
982 $has_referred_to_node{$page} = 1 if $node;
983
984 # Ignore certain placeholder links in perldelta. Check if the
985 # link is page-level, and also check if to a node within the page
986 if ($self->name && $self->name eq "perldelta"
987 && ((grep { $page eq $_ } @perldelta_ignore_links)
988 || ($node
989 && (grep { "$page/$node" eq $_ } @perldelta_ignore_links)
990 ))) {
991 return;
992 }
993 }
16384ac1
KW
994 return $self->SUPER::hyperlink($_[0]);
995 }
996
997 sub node {
998 my $self = shift;
999 my $text = $_[0];
1000 if($text) {
1001 $text =~ s/\s+$//s; # strip trailing whitespace
1002 $text =~ s/\s+/ /gs; # collapse whitespace
1003 my $addr = Scalar::Util::refaddr $self;
1004 push(@{$linkable_nodes{$addr}}, $text) if
1005 ! $current_indent{$addr}
1006 || $linkable_item{$addr};
1007 }
1008 return $self->SUPER::node($_[0]);
1009 }
1010
1011 sub get_current_indent {
1012 return $INDENT + $current_indent{Scalar::Util::refaddr $_[0]};
1013 }
1014
1015 sub get_filename {
1016 return $filename{Scalar::Util::refaddr $_[0]};
1017 }
1018
1019 sub linkable_nodes {
1020 my $linkables = $linkable_nodes{Scalar::Util::refaddr $_[0]};
1021 return undef unless $linkables;
1022 return @$linkables;
1023 }
1024
1025 sub get_skip {
1026 return $skip{Scalar::Util::refaddr $_[0]} // 0;
1027 }
1028
1029 sub set_skip {
1030 my $self = shift;
1031 $skip{Scalar::Util::refaddr $self} = shift;
1032
1033 # If skipping, no need to keep the problems for it
1034 delete $problems{$self->get_filename};
1035 return;
1036 }
1037}
1038
1039package Tie_Array_to_FH { # So printing actually goes to an array
1040
1041 my %array;
1042
1043 sub TIEHANDLE {
1044 my $class = shift;
1045 my $array_ref = shift;
1046
1047 my $self = bless \do{ my $anonymous_scalar }, $class;
1048 $array{Scalar::Util::refaddr $self} = $array_ref;
1049
1050 return $self;
1051 }
1052
1053 sub PRINT {
a67b1afa 1054 my $self = shift;
16384ac1
KW
1055 push @{$array{Scalar::Util::refaddr $self}}, @_;
1056 return 1;
1057 }
a67b1afa
MM
1058}
1059
69f6a9a1 1060
16384ac1
KW
1061my %filename_to_checker; # Map a filename to it's pod checker object
1062my %id_to_checker; # Map a checksum to it's pod checker object
1063my %nodes; # key is filename, values are nodes in that file.
1064my %nodes_first_word; # same, but value is first word of each node
1065my %valid_modules; # List of modules known to exist outside us.
1066my %digests; # checksums of files, whose names are the keys
1067my %filename_to_pod; # Map a filename to its pod NAME
1068my %files_with_unknown_issues;
1069my %files_with_fixes;
69f6a9a1 1070
16384ac1 1071my $data_fh;
be39e7f8 1072open $data_fh, '<:bytes', $known_issues or die "Can't open $known_issues";
69f6a9a1 1073
16384ac1
KW
1074my %counts; # For --counts param, count of each issue type
1075my %suppressed_files; # Files with at least one issue type to suppress
e57d740c
KW
1076my $HEADER = <<END;
1077# This file is the data file for $0.
1078# There are three types of lines.
1079# Comment lines are white-space only or begin with a '#', like this one. Any
1080# changes you make to the comment lines will be lost when the file is
1081# regen'd.
1082# Lines without tab characters are simply NAMES of pods that the program knows
1083# will have links to them and the program does not check if those links are
1084# valid.
1085# All other lines should have three fields, each separated by a tab. The
1086# first field is the name of a pod; the second field is an error message
1087# generated by this program; and the third field is a count of how many
1088# known instances of that message there are in the pod. -1 means that the
1089# program can expect any number of this type of message.
1090END
16384ac1 1091
e57d740c 1092my @existing_issues;
477100f8 1093
477100f8 1094
16384ac1 1095while (<$data_fh>) { # Read the data base
69f6a9a1 1096 chomp;
16384ac1
KW
1097 next if /^\s*(?:#|$)/; # Skip comment and empty lines
1098 if (/\t/) {
1099 next if $show_all;
e57d740c
KW
1100 if ($add_link) { # The issues are saved and later output unchanged
1101 push @existing_issues, $_;
1102 next;
1103 }
16384ac1
KW
1104
1105 # Keep track of counts of each issue type for each file
1106 my ($filename, $message, $count) = split /\t/;
1107 $known_problems{$filename}{$message} = $count;
1108
1109 if ($show_counts) {
1110 if ($count < 0) { # -1 means to suppress this issue type
1111 $suppressed_files{$filename} = $filename;
1112 }
1113 else {
1114 $counts{$message} += $count;
1115 }
1116 }
1117 }
1118 else { # Lines without a tab are modules known to be valid
1119 $valid_modules{$_} = 1
1120 }
1121}
1122close $data_fh;
1123
e57d740c
KW
1124if ($add_link) {
1125 $copy_fh = open_new($known_issues);
1126
1127 # Check for basic sanity, and add each command line argument
1128 foreach my $module (@files) {
1129 die "\"$module\" does not look like a module or man page"
1130 # Must look like (A or A::B or A::B::C ..., or foo(3C)
1131 if $module !~ /^ (?: \w+ (?: :: \w+ )* | \w+ \( \d \w* \) ) $/x;
1132 $valid_modules{$module} = 1
1133 }
1134 my_safer_print($copy_fh, $HEADER);
1135 foreach (sort { lc $a cmp lc $b } keys %valid_modules) {
1136 my_safer_print($copy_fh, $_, "\n");
1137 }
1138
1139 # The rest of the db file is output unchanged.
63717411 1140 my_safer_print($copy_fh, join "\n", @existing_issues, "");
e57d740c
KW
1141
1142 close_and_rename($copy_fh);
1143 exit;
1144}
1145
16384ac1
KW
1146if ($show_counts) {
1147 my $total = 0;
1148 foreach my $message (sort keys %counts) {
1149 $total += $counts{$message};
1150 note(Text::Tabs::expand("$counts{$message}\t$message"));
1151 }
1152 note("-----\n" . Text::Tabs::expand("$total\tknown potential issues"));
1153 if (%suppressed_files) {
1154 note("\nFiles that have all messages of at least one type suppressed:");
1155 note(join ",", keys %suppressed_files);
1156 }
1157 exit 0;
1158}
1159
16384ac1
KW
1160# re to match files that are to be parsed only if there is an internal link
1161# to them. It does not include cpan, as whether those are parsed depends
0496e0bb
KW
1162# on a switch. Currently, only perltoc and the stable perldelta.pod's
1163# are included. The latter all have characters between 'perl' and
1164# 'delta'. (Actually the currently developed one matches as well, but
1165# is a duplicate of perldelta.pod, so can be skipped, so fine for it to
1166# match this.
d9e2eb4b 1167my $only_for_interior_links_re = qr/ ^ pod\/perltoc.pod $
0496e0bb 1168 /x;
d9e2eb4b
KW
1169unless ($do_deltas) {
1170 $only_for_interior_links_re = qr/$only_for_interior_links_re |
1171 \b perl \d+ delta \. pod \b
1172 /x;
1173}
16384ac1
KW
1174
1175{ # Closure
1176 my $first_time = 1;
1177
1178 sub output_thanks ($$$$) { # Called when an issue has been fixed
1179 my $filename = shift;
1180 my $original_count = shift;
1181 my $current_count = shift;
1182 my $message = shift;
1183
1184 $files_with_fixes{$filename} = 1;
1185 my $return;
1186 my $fixed_count = $original_count - $current_count;
1187 my $a_problem = ($fixed_count == 1) ? "a problem" : "multiple problems";
1188 my $another_problem = ($fixed_count == 1) ? "another problem" : "another set of problems";
1189 my $diff;
1190 if ($message) {
1191 $diff = <<EOF;
1192There were $original_count occurrences (now $current_count) in this pod of type
1193"$message",
1194EOF
1195 } else {
1196 $diff = <<EOF;
1197There are no longer any problems found in this pod!
1198EOF
1199 }
1200
1201 if ($first_time) {
1202 $first_time = 0;
1203 $return = <<EOF;
1204Thanks for fixing $a_problem!
1205$diff
1206Now you must teach $0 that this was fixed.
1207EOF
1208 }
1209 else {
1210 $return = <<EOF
1211Thanks for fixing $another_problem.
1212$diff
1213EOF
1214 }
1215
1216 return $return;
1217 }
1218}
1219
1220sub my_safer_print { # print, with error checking for outputting to db
1221 my ($fh, @lines) = @_;
1222
1223 if (! print $fh @lines) {
1224 my $save_error = $!;
1225 close($fh);
1226 die "Write failure: $save_error";
1227 }
1228}
1229
1230sub extract_pod { # Extracts just the pod from a file
1231 my $filename = shift;
1232
1233 my @pod;
1234
1235 # Arrange for the output of Pod::Parser to be collected in an array we can
1236 # look at instead of being printed
1237 tie *ALREADY_FH, 'Tie_Array_to_FH', \@pod;
be39e7f8 1238 open my $in_fh, '<:bytes', $filename
763df156
KW
1239
1240 # The file should already have been opened once to get here, so if
77b8b9ad
KW
1241 # fails, just die. It's possible that a transitory file containing a
1242 # pod would get here, but not bothering to add code for that very
1243 # unlikely event.
16384ac1
KW
1244 or die "Can't open '$filename': $!\n";
1245
1246 my $parser = Pod::Parser->new();
1247 $parser->parse_from_filehandle($in_fh, *ALREADY_FH);
1248 close $in_fh;
1249
1250 return join "", @pod
1251}
1252
1253my $digest = Digest->new($digest_type);
1254
1255sub is_pod_file {
c1dcaaab
KW
1256 # If $_ is a pod file, add it to the lists and do other prep work.
1257
e42a86f0 1258 if (-d) {
16384ac1
KW
1259 # Don't look at files in directories that are for tests, nor those
1260 # beginning with a dot
1261 if ($_ eq 't' || $_ =~ /^\../) {
1262 $File::Find::prune = 1;
1263 }
1264 return;
1265 }
1266
e42a86f0
KW
1267 return unless -r && -s; # Can't check it if can't read it; no need to
1268 # check if 0 length
1269 return unless -f || -l; # Weird file types won't be pods
1270
de4da25d
CB
1271 if ($_ =~ /^\./ # No hidden Unix files
1272 || $_ =~ $non_pods) {
1273 note("Not considering $_") if DEBUG;
1274 return;
1275 }
1276
16384ac1
KW
1277 my $filename = $File::Find::name;
1278
3673acb0 1279 # $filename is relative, like './path'. Strip that initial part away.
16384ac1
KW
1280 # Assumes that the path separator is exactly one character.
1281 $filename =~ s/^\..//;
1282
b3fdb838 1283 return if $excluded_files{canonicalize($filename)};
16384ac1 1284
144a708b
NC
1285 my $contents = do {
1286 local $/;
763df156
KW
1287 my $candidate;
1288 if (! open $candidate, '<:bytes', $_) {
1289
77b8b9ad
KW
1290 # If a transitory file was found earlier, the open could fail
1291 # legitimately and we just skip the file; also skip it if it is a
1292 # broken symbolic link, as it is probably just a build problem;
1293 # certainly not a file that we would want to check the pod of.
1294 # Otherwise fail it here and no reason to process it further.
1295 # (But the test count will be off too)
1296 ok(0, "Can't open '$filename': $!")
e42a86f0 1297 if -r $filename && ! -l $filename;
763df156
KW
1298 return;
1299 }
144a708b
NC
1300 <$candidate>;
1301 };
16384ac1
KW
1302
1303 # If the file is a .pm or .pod, having any initial '=' on a line is
8505a15d
KW
1304 # grounds for testing it. Otherwise, require a head1 NAME line to
1305 # consider it as a potential pod
144a708b
NC
1306 if ($filename =~ /\.(?:pm|pod)/) {
1307 return unless $contents =~ /^=/m;
1308 } else {
1309 return unless $contents =~ /^=head1 +NAME/m;
16384ac1 1310 }
144a708b
NC
1311
1312 # Here, we know that the file is a pod. Add it to the list of files
1313 # to check and create a checker object for it.
1314
1315 push @files, $filename;
1316 my $checker = My::Pod::Checker->new($filename);
1317 $filename_to_checker{$filename} = $checker;
1318
1319 # In order to detect duplicate pods and only analyze them once, we
1320 # compute checksums for the file, so don't have to do an exact
1321 # compare. Note that if the pod is just part of the file, the
1322 # checksums can differ for the same pod. That special case is handled
1323 # later, since if the checksums of the whole file are the same, that
1324 # case won't even come up. We don't need the checksums for files that
1325 # we parse only if there is a link to its interior, but we do need its
1326 # NAME, which is also retrieved in the code below.
1327
1328 if ($filename =~ / (?: ^(cpan|lib|ext|dist)\/ )
1329 | $only_for_interior_links_re
1330 /x) {
1331 $digest->add($contents);
1332 $digests{$filename} = $digest->digest;
1333
1334 # lib files aren't analyzed if they are duplicates of files copied
1335 # there from some other directory. But to determine this, we need
1336 # to know their NAMEs. We might as well find the NAME now while
1337 # the file is open. Similarly, cpan files aren't analyzed unless
1338 # we're analyzing all of them, or this particular file is linked
1339 # to by a file we are analyzing, and thus we will want to verify
1340 # that the target exists in it. We need to know at least the NAME
1341 # to see if it's worth analyzing, or so we can determine if a lib
1342 # file is a copy of a cpan one.
1343 if ($filename =~ m{ (?: ^ (?: cpan | lib ) / )
16384ac1 1344 | $only_for_interior_links_re
144a708b
NC
1345 }x) {
1346 if ($contents =~ /^=head1 +NAME.*/mg) {
1347 # The NAME is the first non-spaces on the line up to a
1348 # comma, dash or end of line. Otherwise, it's invalid and
1349 # this pod doesn't have a legal name that we're smart
1350 # enough to find currently. But the parser will later
1351 # find it if it thinks there is a legal name, and set the
1352 # name
1353 if ($contents =~ /\G # continue from the line after =head1
1354 \s* # ignore any empty lines
1355 ^ \s* ( \S+?) \s* (?: [,-] | $ )/mx) {
1356 my $name = $1;
1357 $checker->name($name);
1358 $id_to_checker{$name} = $checker
1359 if $filename =~ m{^cpan/};
16384ac1
KW
1360 }
1361 }
144a708b
NC
1362 elsif ($filename =~ m{^cpan/}) {
1363 $id_to_checker{$digests{$filename}} = $checker;
1364 }
16384ac1
KW
1365 }
1366 }
c1dcaaab
KW
1367
1368 return;
16384ac1
KW
1369} # End of is_pod_file()
1370
477100f8
KW
1371# Start of real code that isn't processing the command line (except the
1372# db is read in above, as is processing of the --add_link option).
16384ac1
KW
1373# Here, @files contains list of files on the command line. If have any of
1374# these, unconditionally test them, and show all the errors, even the known
1375# ones, and, since not testing other pods, don't do cross-pod link tests.
1376# (Could add extra code to do cross-pod tests for the ones in the list.)
477100f8 1377
16384ac1
KW
1378if ($has_input_files) {
1379 undef %known_problems;
d9e2eb4b
KW
1380 $do_upstream_cpan = $do_deltas = 1; # In case one of the inputs is one
1381 # of these types
16384ac1
KW
1382}
1383else { # No input files -- go find all the possibilities.
1384 if ($regen) {
1385 $copy_fh = open_new($known_issues);
1386 note("Regenerating $known_issues, please be patient...");
e57d740c 1387 print $copy_fh $HEADER;
16384ac1
KW
1388 }
1389
1390 # Move to the directory above us, but have to adjust @INC to account for
1391 # that.
1392 s{^\.\./lib$}{lib} for @INC;
1393 chdir File::Spec->updir;
1394
1395 # And look in this directory and all its subdirectories
1396 find( \&is_pod_file, '.');
1397
1398 # Add ourselves to the test
b3fdb838 1399 push @files, "t/porting/podcheck.t";
16384ac1
KW
1400}
1401
1402# Now we know how many tests there will be.
1403plan (tests => scalar @files) if ! $regen;
1404
1405
1406 # Sort file names so we get consistent results, and to put cpan last,
1407 # preceeded by the ones that we don't generally parse. This is because both
1408 # these classes are generally parsed only if there is a link to the interior
1409 # of them, and we have to parse all others first to guarantee that they don't
1410 # have such a link. 'lib' files come just before these, as some of these are
1411 # duplicates of others. We already have figured this out when gathering the
1412 # data as a special case for all such files, but this, while unnecessary,
1413 # puts the derived file last in the output. 'readme' files come before those,
1414 # as those also could be duplicates of others, which are considered the
1415 # primary ones. These currently aren't figured out when gathering data, so
1416 # are done here.
1417 @files = sort { if ($a =~ /^cpan/) {
1418 return 1 if $b !~ /^cpan/;
e08998bf 1419 return lc $a cmp lc $b;
16384ac1
KW
1420 }
1421 elsif ($b =~ /^cpan/) {
1422 return -1;
1423 }
1424 elsif ($a =~ /$only_for_interior_links_re/) {
1425 return 1 if $b !~ /$only_for_interior_links_re/;
e08998bf 1426 return lc $a cmp lc $b;
16384ac1
KW
1427 }
1428 elsif ($b =~ /$only_for_interior_links_re/) {
1429 return -1;
1430 }
1431 elsif ($a =~ /^lib/) {
1432 return 1 if $b !~ /^lib/;
e08998bf 1433 return lc $a cmp lc $b;
16384ac1
KW
1434 }
1435 elsif ($b =~ /^lib/) {
1436 return -1;
1437 } elsif ($a =~ /\breadme\b/i) {
1438 return 1 if $b !~ /\breadme\b/i;
e08998bf 1439 return lc $a cmp lc $b;
16384ac1
KW
1440 }
1441 elsif ($b =~ /\breadme\b/i) {
1442 return -1;
1443 }
1444 else {
1445 return lc $a cmp lc $b;
1446 }
1447 }
1448 @files;
1449
1450# Now go through all the files and parse them
1451foreach my $filename (@files) {
1452 my $parsed = 0;
1453 note("parsing $filename") if DEBUG;
1454
1455 # We may have already figured out some things in the process of generating
8505a15d 1456 # the file list. If so, we have a $checker object already. But if not,
16384ac1
KW
1457 # generate one now.
1458 my $checker = $filename_to_checker{$filename};
1459 if (! $checker) {
1460 $checker = My::Pod::Checker->new($filename);
1461 $filename_to_checker{$filename} = $checker;
1462 }
1463
1464 # We have set the name in the checker object if there is a possibility
1465 # that no further parsing is necessary, but otherwise do the parsing now.
1466 if (! $checker->name) {
1467 $parsed = 1;
1468 $checker->parse_from_file($filename, undef);
1469 }
1470
1471 if ($checker->num_errors() < 0) { # Returns negative if not a pod
1472 $checker->set_skip("$filename is not a pod");
1473 }
1474 else {
1475
1476 # Here, is a pod. See if it is one that has already been tested,
1477 # or should be tested under another directory. Use either its NAME
1478 # if it has one, or a checksum if not.
1479 my $name = $checker->name;
1480 my $id;
1481
1482 if ($name) {
1483 $id = $name;
1484 }
1485 else {
1486 my $digest = Digest->new($digest_type);
1487 $digest->add(extract_pod($filename));
1488 $id = $digest->digest;
1489 }
1490
1491 # If there is a match for this pod with something that we've already
1492 # processed, don't process it, and output why.
1493 my $prior_checker;
1494 if (defined ($prior_checker = $id_to_checker{$id})
1495 && $prior_checker != $checker) # Could have defined the checker
1496 # earlier without pursuing it
1497 {
1498
1499 # If the pods are identical, then it's just a copy, and isn't an
1500 # error. First use the checksums we have already computed to see
1501 # if the entire files are identical, which means that the pods are
1502 # identical too.
1503 my $prior_filename = $prior_checker->get_filename;
1504 my $same = (! $name
1505 || ($digests{$prior_filename}
1506 && $digests{$filename}
1507 && $digests{$prior_filename} eq $digests{$filename}));
1508
1509 # If they differ, it could be that the files differ for some
1510 # reason, but the pods they contain are identical. Extract the
1511 # pods and do the comparisons on just those.
1512 if (! $same && $name) {
1513 $same = extract_pod($prior_filename) eq extract_pod($filename);
1514 }
1515
1516 if ($same) {
1517 $checker->set_skip("The pod of $filename is a duplicate of "
1518 . "the pod for $prior_filename");
1519 } elsif ($prior_filename =~ /\breadme\b/i) {
1520 $checker->set_skip("$prior_filename is a README apparently for $filename");
1521 } elsif ($filename =~ /\breadme\b/i) {
1522 $checker->set_skip("$filename is a README apparently for $prior_filename");
44b8cd40
KW
1523 } elsif (! $do_upstream_cpan
1524 && $filename =~ /^cpan/
1525 && $prior_filename =~ /^cpan/)
1526 {
ef906498 1527 $checker->set_skip("CPAN is upstream for $filename");
16384ac1
KW
1528 } else { # Here have two pods with identical names that differ
1529 $prior_checker->poderror(
1530 { -msg => $duplicate_name,
1531 -line => "???",
1532 parameter => "'$filename' also has NAME '$name'"
1533 });
1534 $checker->poderror(
1535 { -msg => $duplicate_name,
1536 -line => "???",
1537 parameter => "'$prior_filename' also has NAME '$name'"
1538 });
1539
1540 # Changing the names helps later.
1541 $prior_checker->name("$name version arbitrarily numbered 1");
1542 $checker->name("$name version arbitrarily numbered 2");
1543 }
1544
1545 # In any event, don't process this pod that has the same name as
1546 # another.
1547 next;
1548 }
1549
1550 # A unique pod.
1551 $id_to_checker{$id} = $checker;
1552
1553 my $parsed_for_links = ", but parsed for its interior links";
1554 if ((! $do_upstream_cpan && $filename =~ /^cpan/)
1555 || $filename =~ $only_for_interior_links_re)
1556 {
1557 if ($filename =~ /^cpan/) {
1558 $checker->set_skip("CPAN is upstream for $filename");
1559 }
d9e2eb4b 1560 elsif ($filename =~ /perl\d+delta/ && ! $do_deltas) {
16384ac1
KW
1561 $checker->set_skip("$filename is a stable perldelta");
1562 }
0496e0bb
KW
1563 elsif ($filename =~ /perltoc/) {
1564 $checker->set_skip("$filename dependent on component pods");
1565 }
16384ac1
KW
1566 else {
1567 croak("Unexpected file '$filename' encountered that has parsing for interior-linking only");
1568 }
1569
1570 if ($name && $has_referred_to_node{$name}) {
1571 $checker->set_skip($checker->get_skip() . $parsed_for_links);
1572 }
1573 }
1574
1575 # Need a name in order to process it, because not meaningful
1576 # otherwise, and also can't test links to this without a name.
1577 if (!defined $name) {
1578 $checker->poderror( { -msg => $no_name,
1579 -line => '???'
1580 });
1581 next;
1582 }
1583
1584 # For skipped files, just get its NAME
1585 my $skip;
1586 if (($skip = $checker->get_skip()) && $skip !~ /$parsed_for_links/)
1587 {
1588 $checker->node($name) if $name;
1589 }
1590 else {
1591 $checker->parse_from_file($filename, undef) if ! $parsed;
1592 }
1593
1594 # Go through everything in the file that could be an anchor that
1595 # could be a link target. Count how many there are of the same name.
1596 foreach my $node ($checker->linkable_nodes) {
1597 next if ! $node; # Can be empty is like '=item *'
1598 if (exists $nodes{$name}{$node}) {
1599 $nodes{$name}{$node}++;
1600 }
1601 else {
1602 $nodes{$name}{$node} = 1;
1603 }
1604
1605 # Experiments have shown that cpan search can figure out the
1606 # target of a link even if the exact wording is incorrect, as long
1607 # as the first word is. This happens frequently in perlfunc.pod,
1608 # where the link will be just to the function, but the target
1609 # entry also includes parameters to the function.
1610 my $first_word = $node;
1611 if ($first_word =~ s/^(\S+)\s+\S.*/$1/) {
1612 $nodes_first_word{$name}{$first_word} = $node;
1613 }
1614 }
1615 $filename_to_pod{$filename} = $name;
1616 }
1617}
1618
1619# Here, all files have been parsed, and all links and link targets are stored.
1620# Now go through the files again and see which don't have matches.
1621if (! $has_input_files) {
1622 foreach my $filename (@files) {
1623 next if $filename_to_checker{$filename}->get_skip;
1624 my $checker = $filename_to_checker{$filename};
1625 foreach my $link ($checker->hyperlink) {
1626 my $linked_to_page = $link->[1]->page;
1627 next unless $linked_to_page; # intra-file checks are handled by std
1628 # Pod::Checker
1629
1630 # Initialize the potential message.
1631 my %problem = ( -msg => $broken_link,
1632 -line => $link->[0],
1633 parameter => "to \"$linked_to_page\"",
1634 );
1635
1636 # See if we have found the linked-to_file in our parse
1637 if (exists $nodes{$linked_to_page}) {
1638 my $node = $link->[1]->node;
1639
1640 # If link is only to the page-level, already have it
1641 next if ! $node;
1642
1643 # Transform pod language to what we are expecting
1644 $node =~ s,E<sol>,/,g;
1645 $node =~ s/E<verbar>/|/g;
1646
1647 # If link is to a node that exists in the file, is ok
1648 if ($nodes{$linked_to_page}{$node}) {
1649
1650 # But if the page has multiple targets with the same name,
1651 # it's ambiguous which one this should be to.
1652 if ($nodes{$linked_to_page}{$node} > 1) {
1653 $problem{-msg} = $multiple_targets;
1654 $problem{parameter} = "in $linked_to_page that $node could be pointing to";
1655 $checker->poderror(\%problem);
1656 }
1657 } elsif (! $nodes_first_word{$linked_to_page}{$node}) {
1658
1659 # Here the link target was not found, either exactly or to
1660 # the first word. Is an error.
1661 $problem{parameter} =~ s,"$,/$node",;
1662 $checker->poderror(\%problem);
1663 }
1664
1665 } # Linked-to-file not in parse; maybe is in exception list
1666 elsif (! exists $valid_modules{$link->[1]->page}) {
1667
1668 # Here, is a link to a target that we can't find. Check if
1669 # there is an internal link on the page with the target name.
1670 # If so, it could be that they just forgot the initial '/'
1c01047d
KW
1671 # But perldelta is handled specially: only do this if the
1672 # broken link isn't one of the known bad ones (that are
1673 # placemarkers and should be removed for the final)
1674 my $NAME = $filename_to_pod{$filename};
1675 if (! defined $NAME) {
1676 $checker->poderror(\%problem);
1677 }
02987562 1678 else {
1c01047d
KW
1679 if ($nodes{$NAME}{$linked_to_page}) {
1680 $problem{-msg} = $broken_internal_link;
1681 }
1682 $checker->poderror(\%problem);
16384ac1 1683 }
16384ac1
KW
1684 }
1685 }
1686 }
1687}
1688
1689# If regenerating the data file, start with the modules for which we don't
de37f83f
KW
1690# check targets. If you change the sort order, you need to run --regen before
1691# committing so that future commits that do run regen don't show irrelevant
1692# changes.
16384ac1
KW
1693if ($regen) {
1694 foreach (sort { lc $a cmp lc $b } keys %valid_modules) {
1695 my_safer_print($copy_fh, $_, "\n");
1696 }
1697}
1698
1699# Now ready to output the messages.
1700foreach my $filename (@files) {
b3fdb838 1701 my $canonical = canonicalize($filename);
16384ac1
KW
1702 SKIP: {
1703 my $skip = $filename_to_checker{$filename}->get_skip // "";
1704
1705 if ($regen) {
1706 foreach my $message ( sort keys %{$problems{$filename}}) {
1707 my $count;
1708
1709 # Preserve a negative setting.
b3fdb838
KW
1710 if ($known_problems{$canonical}{$message}
1711 && $known_problems{$canonical}{$message} < 0)
16384ac1 1712 {
b3fdb838 1713 $count = $known_problems{$canonical}{$message};
16384ac1
KW
1714 }
1715 else {
1716 $count = @{$problems{$filename}{$message}};
1717 }
24df3027 1718 my_safer_print($copy_fh, $canonical . "\t$message\t$count\n");
16384ac1
KW
1719 }
1720 next;
1721 }
1722
1723 skip($skip, 1) if $skip;
1724 my @diagnostics;
1725 my $indent = ' ';
1726
16384ac1
KW
1727 my $total_known = 0;
1728 foreach my $message ( sort keys %{$problems{$filename}}) {
b3fdb838
KW
1729 $known_problems{$canonical}{$message} = 0
1730 if ! $known_problems{$canonical}{$message};
16384ac1
KW
1731 my $diagnostic = "";
1732 my $problem_count = scalar @{$problems{$filename}{$message}};
1733 $total_known += $problem_count;
b3fdb838
KW
1734 next if $known_problems{$canonical}{$message} < 0;
1735 if ($problem_count > $known_problems{$canonical}{$message}) {
16384ac1
KW
1736
1737 # Here we are about to output all the messages for this type,
1738 # subtract back this number we previously added in.
1739 $total_known -= $problem_count;
1740
1741 $diagnostic .= $indent . $message;
1742 if ($problem_count > 2) {
1743 $diagnostic .= " ($problem_count occurrences)";
1744 }
1745 foreach my $problem (@{$problems{$filename}{$message}}) {
1746 $diagnostic .= " " if $problem_count == 1;
1747 $diagnostic .= "\n$indent$indent";
1748 $diagnostic .= "$problem->{parameter}" if $problem->{parameter};
1749 $diagnostic .= " near line $problem->{-line}";
1750 $diagnostic .= " $problem->{comment}" if $problem->{comment};
1751 }
1752 $diagnostic .= "\n";
1753 $files_with_unknown_issues{$filename} = 1;
b3fdb838
KW
1754 } elsif ($problem_count < $known_problems{$canonical}{$message}) {
1755 $diagnostic = output_thanks($filename, $known_problems{$canonical}{$message}, $problem_count, $message);
16384ac1
KW
1756 }
1757 push @diagnostics, $diagnostic if $diagnostic;
1758 }
1759
09ea063a
KW
1760 # The above loop has output messages where there are current potential
1761 # issues. But it misses where there were some that have been entirely
1762 # fixed. For those, we need to look through the old issues
b3fdb838 1763 foreach my $message ( sort keys %{$known_problems{$canonical}}) {
09ea063a 1764 next if $problems{$filename}{$message};
b3fdb838
KW
1765 next if ! $known_problems{$canonical}{$message};
1766 next if $known_problems{$canonical}{$message} < 0; # Preserve negs
1767 my $diagnostic = output_thanks($filename, $known_problems{$canonical}{$message}, 0, $message);
09ea063a
KW
1768 push @diagnostics, $diagnostic if $diagnostic;
1769 }
1770
16384ac1
KW
1771 my $output = "POD of $filename";
1772 $output .= ", excluding $total_known not shown known potential problems"
1773 if $total_known;
1774 ok(@diagnostics == 0, $output);
1775 if (@diagnostics) {
1776 note(join "", @diagnostics,
477100f8 1777 "See end of this test output for your options on silencing this");
16384ac1
KW
1778 }
1779 }
1780}
1781
1782my $how_to = <<EOF;
1783 run this test script by hand, using the following formula (on
1784 Un*x-like machines):
1785 cd t
1786 ./perl -I../lib porting/podcheck.t --regen
1787EOF
1788
1789if (%files_with_unknown_issues) {
1790 my $were_count_files = scalar keys %files_with_unknown_issues;
1791 $were_count_files = ($were_count_files == 1)
1792 ? "was $were_count_files file"
1793 : "were $were_count_files files";
1794 my $message = <<EOF;
1795
1796HOW TO GET THIS .t TO PASS
1797
477100f8 1798There $were_count_files that had new potential problems identified.
bc20e6b8 1799Some of them may be real, and some of them may be false positives because
968d3762
KW
1800this program isn't as smart as it likes to think it is. You can teach this
1801program to ignore the issues it has identified, and hence pass, by doing the
477100f8 1802following:
16384ac1 1803
477100f8
KW
18041) If a problem is about a link to an unknown module or man page that
1805 you know exists, re-run the command something like:
1806 ./perl -I../lib porting/podcheck.t --add_link MODULE man_page ...
1807 (MODULEs should look like Foo::Bar, and man_pages should look like
1808 bar(3c); don't do this for a module or man page that you aren't sure
1809 about; instead treat as another type of issue and follow the
1810 instructions below.)
16384ac1
KW
1811
18122) For other issues, decide if each should be fixed now or not. Fix the
1813 ones you decided to, and rerun this test to verify that the fixes
1814 worked.
1815
968d3762
KW
18163) If there remain false positive or problems that you don't plan to fix right
1817 now,
16384ac1 1818$how_to
477100f8
KW
1819 That should cause all current potential problems to be accepted by
1820 the program, so that the next time it runs, they won't be flagged.
16384ac1
KW
1821EOF
1822 if (%files_with_fixes) {
1823 $message .= " This step will also take care of the files that have fixes in them\n";
1824 }
1825
1826 $message .= <<EOF;
1827 For a few files, such as perltoc, certain issues will always be
1828 expected, and more of the same will be added over time. For those,
1829 before you do the regen, you can edit
1830 $known_issues
1831 and find the entry for the module's file and specific error message,
1832 and change the count of known potential problems to -1.
1833EOF
1834
1835 note($message);
1836} elsif (%files_with_fixes) {
1837 note(<<EOF
1838To teach this test script that the potential problems have been fixed,
1839$how_to
1840EOF
1841 );
1842}
1843
1844if ($regen) {
1845 chdir $original_dir || die "Can't change directories to $original_dir";
1846 close_and_rename($copy_fh);
1847}