linkchecker/doc/en/linkchecker.1
2009-01-08 14:18:03 +00:00

414 lines
14 KiB
Groff

.TH LINKCHECKER 1 2001-03-10 "LinkChecker" "LinkChecker commandline usage"
.SH NAME
linkchecker - check HTML documents and websites for broken links
.
.SH SYNOPSIS
\fBlinkchecker\fP [\fIoptions\fP] [\fIfile-or-url\fP]...
.
.SH DESCRIPTION
.LP
LinkChecker features
recursive checking,
multithreading,
output in colored or normal text, HTML, SQL, CSV or a sitemap
graph in GML or XML,
support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet
and local file links,
restriction of link checking with regular expression filters for URLs,
proxy support,
username/password authorization for HTTP and FTP,
robots.txt exclusion protocol support,
i18n support,
a command line interface and
a (Fast)CGI web interface (requires HTTP server)
.
.SH EXAMPLES
The most common use checks the given domain recursively, plus any
URL pointing outside of the domain:
\fBlinkchecker http://treasure.calvinsplayground.de/\fP
.br
Beware that this checks the whole site which can have thousands of URLs.
Use the \fB\-r\fP option to restrict the recursion depth.
.br
Don't connect to \fBmailto:\fP hosts, only check their URL syntax. All other
links are checked as usual:
\fBlinkchecker \-\-ignore\-url=^mailto: www.mysite.org\fP
.br
Checking a local HTML file on Unix:
\fBlinkchecker ../bla.html\fP
.br
Checking from stdin:
\fBecho "bla.html" | linkchecker --stdin\fP
.br
Checking a local HTML file on Windows:
\fBlinkchecker c:\\temp\\test.html\fP
.br
You can skip the \fBhttp://\fP url part if the domain starts with \fBwww.\fP:
\fBlinkchecker www.myhomepage.de\fP
.br
You can skip the \fBftp://\fP url part if the domain starts with \fBftp.\fP:
\fBlinkchecker \-r0 ftp.linux.org\fP
.br
Generate a sitemap graph and convert it with the graphviz dot utility:
\fBlinkchecker \-odot \-v www.myhomepage.de | dot \-Tps > sitemap.ps\fP
.
.SH OPTIONS
.SS General options
.TP
\fB\-h\fP, \fB\-\-help\fP
Help me! Print usage information for this program.
.TP
\fB\-f\fP\fIFILENAME\fP, \fB\-\-config=\fP\fIFILENAME\fP
Use \fIFILENAME\fP as configuration file. As default LinkChecker first
searches \fB/etc/linkchecker/linkcheckerrc\fP and then
\fB~/.linkchecker/linkcheckerrc\fP.
.TP
\fB\-I\fP, \fB\-\-interactive\fP
Ask for URL if none are given on the commandline.
.TP
\fB\-t\fP\fINUMBER\fP, \fB\-\-threads=\fP\fINUMBER\fP
Generate no more than the given number of threads. Default number
of threads is 10. To disable threading specify a non-positive number.
.TP
\fB\-\-priority\fP
Run with normal thread scheduling priority. Per default LinkChecker runs
with low thread priority to be suitable as a background job.
.TP
\fB\-V\fP, \fB\-\-version\fP
Print version and exit.
.TP
\fB\-\-allow\-root\fP
Do not drop privileges when running as root user on Unix systems.
.TP
\fB\-\-stdin\fP
Read list of white-space separated URLs to check from stdin.
.
.SS Output options
.TP
\fB\-v\fP, \fB\-\-verbose\fP
Log all checked URLs once. Default is to log only errors and warnings.
.TP
\fB\-\-complete\fP
Log all URLs, including duplicates. Default is to log duplicate URLs only once.
.TP
\fB\-\-no\-warnings\fP
Don't log warnings. Default is to log warnings.
.TP
\fB\-W\fP\fIREGEX\fP, \fB\-\-warning\-regex=\fIREGEX\fP
Define a regular expression which prints a warning if it matches any
content of the checked link.
This applies only to valid pages, so we can get their content.
.br
Use this to check for pages that contain some form of error, for example
"This page has moved" or "Oracle Application Server error".
.TP
\fB\-\-warning\-size\-bytes=\fP\fINUMBER\fP
Print a warning if content size info is available and exceeds the given
number of \fIbytes\fP.
.TP
\fB\-\-check\-html\fP
Check syntax of HTML URLs with local library (HTML tidy).
.TP
\fB\-\-check\-html\-w3\fP
Check syntax of HTML URLs with W3C online validator.
.TP
\fB\-\-check\-css\fP
Check syntax of CSS URLs with local library (cssutils).
.TP
\fB\-\-check\-css\-w3\fP
Check syntax of CSS URLs with W3C online validator.
.TP
\fB\-\-scan\-virus\fP
Scan content of URLs for viruses with ClamAV.
.TP
\fB\-q\fP, \fB\-\-quiet\fP
Quiet operation, an alias for \fB\-o none\fP.
This is only useful with \fB\-F\fP.
.TP
\fB\-o\fP\fITYPE\fP[\fB/\fP\fIENCODING\fP], \fB\-\-output=\fP\fITYPE\fP[\fB/\fP\fIENCODING\fP]
Specify output type as \fBtext\fP, \fBhtml\fP, \fBsql\fP,
\fBcsv\fP, \fBgml\fP, \fBdot\fP, \fBxml\fP, \fBnone\fP or \fBblacklist\fP.
Default type is \fBtext\fP. The various output types are documented
below.
.br
The \fIENCODING\fP specifies the output encoding, the default is
that of your locale. Valid encodings are listed at
\fBhttp://docs.python.org/lib/standard\-encodings.html\fP.
.TP
\fB\-F\fP\fITYPE\fP[\fB/\fP\fIENCODING\fP][\fB/\fP\fIFILENAME\fP], \fB\-\-file\-output=\fP\fITYPE\fP[\fB/\fP\fIENCODING\fP][\fB/\fP\fIFILENAME\fP]
Output to a file \fBlinkchecker\-out.\fP\fITYPE\fP,
\fB$HOME/.linkchecker/blacklist\fP for
\fBblacklist\fP output, or \fIFILENAME\fP if specified.
The \fIENCODING\fP specifies the output encoding, the default is
that of your locale.
Valid encodings are listed at
\fBhttp://docs.python.org/lib/standard\-encodings.html\fP.
The \fIFILENAME\fP and \fIENCODING\fP parts of the \fBnone\fP output type
will be ignored, else if the file already exists, it will be overwritten.
You can specify this option more than once. Valid file output types
are \fBtext\fP, \fBhtml\fP, \fBsql\fP,
\fBcsv\fP, \fBgml\fP, \fBdot\fP, \fBxml\fP, \fBnone\fP or \fBblacklist\fP
Default is no file output. The various output types are documented
below. Note that you can suppress all console output
with the option \fB\-o none\fP.
.TP
\fB\-\-no\-status\fP
Do not print check status messages.
.TP
\fB\-D\fP\fISTRING\fP, \fB\-\-debug=\fP\fISTRING\fP
Print debugging output for the given logger.
Available loggers are \fBcmdline\fP, \fBchecking\fP,
\fBcache\fP, \fBgui\fP, \fBdns\fP and \fBall\fP.
Specifying \fBall\fP is an alias for specifying all available loggers.
The option can be given multiple times to debug with more
than one logger.
.BR
For accurate results, threading will be disabled during debug runs.
.TP
\fB\-\-trace\fP
Print tracing information.
.TP
\fB\-\-profile\fP
Write profiling data into a file named \fBlinkchecker.prof\fP
in the current working directory. See also \fB\-\-viewprof\fP.
.TP
\fB\-\-viewprof\fP
Print out previously generated profiling data. See also
\fB\-\-profile\fP.
.
.SS Checking options
.TP
\fB\-r\fP\fINUMBER\fP, \fB\-\-recursion\-level=\fP\fINUMBER\fP
Check recursively all links up to given depth.
A negative depth will enable infinite recursion.
Default depth is infinite.
.TP
\fB\-\-no\-follow\-url=\fP\fIREGEX\fP
Check but do not recurse into URLs matching the given regular
expression.
.br
This option can be given multiple times.
.TP
\fB\-\-ignore\-url=\fP\fIREGEX\fP
Only check syntax of URLs matching the given regular expression.
.br
This option can be given multiple times.
.TP
\fB\-C\fP, \fB\-\-cookies\fP
Accept and send HTTP cookies according to RFC 2109. Only cookies
which are sent back to the originating server are accepted.
Sent and accepted cookies are provided as additional logging
information.
.TP
\fB\-\-cookiefile=\fP\fIFILENAME\fP
Read a file with initial cookie data. The cookie data
format is explained below.
.TP
\fB\-a\fP, \fB\-\-anchors\fP
Check HTTP anchor references. Default is not to check anchors.
This option enables logging of the warning \fBurl\-anchor\-not\-found\fP.
.TP
\fB\-\-no\-anchor\-caching\fP
Treat url#anchora and url#anchorb as equal on caching. This
is the default browser behaviour, but it's not specified in
the URI specification. Use with care since broken anchors are not
guaranteed to be detected in this mode.
.TP
\fB\-u\fP\fISTRING\fP, \fB\-\-user=\fP\fISTRING\fP
Try the given username for HTTP and FTP authorization.
For FTP the default username is \fBanonymous\fP. For HTTP there is
no default username. See also \fB\-p\fP.
.TP
\fB\-p\fP\fISTRING\fP, \fB\-\-password=\fP\fISTRING\fP
Try the given password for HTTP and FTP authorization.
For FTP the default password is \fBanonymous@\fP. For HTTP there is
no default password. See also \fB\-u\fP.
.TP
\fB\-\-timeout=\fP\fINUMBER\fP
Set the timeout for connection attempts in seconds. The default timeout
is 60 seconds.
.TP
\fB\-P\fP\fINUMBER\fP, \fB\-\-pause=\fP\fINUMBER\fP
Pause the given number of seconds between two subsequent connection
requests to the same host. Default is no pause between requests.
.TP
\fB\-N\fP\fISTRING\fP, \fB\-\-nntp\-server=\fP\fISTRING\fP
Specify an NNTP server for \fBnews:\fP links. Default is the
environment variable \fBNNTP_SERVER\fP. If no host is given,
only the syntax of the link is checked.
.TP
\fB\-\-no\-proxy\-for=\fP\fIREGEX\fP
Contact hosts that match the given regular expression directly instead of
going through a proxy.
.br
This option can be given multiple times.
.SH "CONFIGURATION FILES"
Configuration files can specify all options above. They can also
specify some options that cannot be set on the command line.
See \fBlinkcheckerrc\fP(5) for more info.
.SH OUTPUT TYPES
Note that by default only errors and warnings are logged.
You should use the \fB\-\-verbose\fP option to get the complete URL list,
especially when outputting a sitemap graph format.
.TP
\fBtext\fP
Standard text logger, logging URLs in keyword: argument fashion.
.TP
\fBhtml\fP
Log URLs in keyword: argument fashion, formatted as HTML.
Additionally has links to the referenced pages. Invalid URLs have
HTML and CSS syntax check links appended.
.TP
\fBcsv\fP
Log check result in CSV format with one URL per line.
.TP
\fBgml\fP
Log parent-child relations between linked URLs as a GML sitemap graph.
.TP
\fBdot\fP
Log parent-child relations between linked URLs as a DOT sitemap graph.
.TP
\fBgxml\fP
Log check result as a GraphXML sitemap graph.
.TP
\fBxml\fP
Log check result as machine-readable XML.
.TP
\fBsql\fP
Log check result as SQL script with INSERT commands. An example
script to create the initial SQL table is included as create.sql.
.TP
\fBblacklist\fP
Suitable for cron jobs. Logs the check result into a file
\fB~/.linkchecker/blacklist\fP which only contains entries with invalid
URLs and the number of times they have failed.
.TP
\fBnone\fP
Logs nothing. Suitable for debugging or checking the exit code.
.
.SH REGULAR EXPRESSIONS
Only Python regular expressions are accepted by LinkChecker.
See \fBhttp://www.amk.ca/python/howto/regex/\fP for an introduction in
regular expressions.
The only addition is that a leading exclamation mark negates the regular
expression.
.
.SH COOKIE FILES
A cookie file contains standard RFC 805 header data with the following
possible names:
.
.TP
\fBScheme\fP (optional)
Sets the scheme the cookies are valid for; default scheme is \fBhttp\fP.
.TP
\fBHost\fP (required)
Sets the domain the cookies are valid for.
.TP
\fBPath\fP (optional)
Gives the path the cookies are value for; default path is \fB/\fP.
.TP
\fBSet-cookie\fP (optional)
Set cookie name/value. Can be given more than once.
.PP
Multiple entries are separated by a blank line.
.
The example below will send two cookies to all URLs starting with
\fBhttp://example.com/hello/\fP and one to all URLs starting
with \fBhttps://example.org/\fP:
Host: example.com
Path: /hello
Set-cookie: ID="smee"
Set-cookie: spam="egg"
Scheme: https
Host: example.org
Set-cookie: baggage="elitist"; comment="hologram"
.SH PROXY SUPPORT
To use a proxy on Unix or Windows set $http_proxy, $https_proxy or $ftp_proxy
to the proxy URL. The URL should be of the form
\fBhttp://\fP[\fIuser\fP\fB:\fP\fIpass\fP\fB@\fP]\fIhost\fP[\fB:\fP\fIport\fP].
LinkChecker also detects manual proxy settings of Internet Explorer under
Windows systems. On a Mac use the Internet Config to select a proxy.
.
Setting a HTTP proxy on Unix for example looks like this:
export http_proxy="http://proxy.example.com:8080"
Proxy authentication is also supported:
export http_proxy="http://user1:mypass@proxy.example.org:8081"
Setting a proxy on the Windows command prompt:
set http_proxy=http://proxy.example.com:8080
.SH NOTES
URLs on the commandline starting with \fBftp.\fP are treated like
\fBftp://ftp.\fP, URLs starting with \fBwww.\fP are treated like
\fBhttp://www.\fP.
You can also give local files as arguments.
If you have your system configured to automatically establish a
connection to the internet (e.g. with diald), it will connect when
checking links not pointing to your local host.
Use the \fB\-s\fP and \fB\-i\fP options to prevent this.
Javascript links are currently ignored.
If your platform does not support threading, LinkChecker disables it
automatically.
You can supply multiple user/password pairs in a configuration file.
When checking \fBnews:\fP links the given NNTP host doesn't need to be the
same as the host of the user browsing your pages.
.
.SH ENVIRONMENT
\fBNNTP_SERVER\fP - specifies default NNTP server
.br
\fBhttp_proxy\fP - specifies default HTTP proxy server
.br
\fBftp_proxy\fP - specifies default FTP proxy server
.br
\fBLC_MESSAGES\fP, \fBLANG\fP, \fBLANGUAGE\fP - specify output language
.
.SH RETURN VALUE
The return value is non-zero when
.IP \(bu
invalid links were found or
.IP \(bu
link warnings were found and warnings are enabled
.IP \(bu
a program error occurred.
.
.SH LIMITATIONS
LinkChecker consumes memory for each queued URL to check. With thousands
of queued URLs the amount of consumed memory can become quite large. This
might slow down the program or even the whole system.
.
.SH FILES
\fB/etc/linkchecker/linkcheckerrc\fP, \fB~/.linkchecker/linkcheckerrc\fP - default
configuration files
.br
\fB~/.linkchecker/blacklist\fP - default blacklist logger output filename
.br
\fBlinkchecker\-out.\fP\fITYPE\fP - default logger file output name
.br
\fBhttp://docs.python.org/lib/standard\-encodings.html\fP - valid output encodings
.br
\fBhttp://www.amk.ca/python/howto/regex/\fP - regular expression documentation
.SH "SEE ALSO"
\fBlinkcheckerrc\fP(5)
.
.SH AUTHOR
Bastian Kleineidam <calvin@users.sourceforge.net>
.
.SH COPYRIGHT
Copyright \(co 2000-2009 Bastian Kleineidam