mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-31 13:10:37 +00:00
492 lines
16 KiB
Groff
492 lines
16 KiB
Groff
.TH LINKCHECKER 1 2010-07-01 "LinkChecker" "LinkChecker commandline usage"
|
|
.SH NAME
|
|
linkchecker - command line client to check HTML documents and websites for broken links
|
|
.
|
|
.SH SYNOPSIS
|
|
\fBlinkchecker\fP [\fIoptions\fP] [\fIfile-or-url\fP]...
|
|
.
|
|
.SH DESCRIPTION
|
|
.LP
|
|
LinkChecker features
|
|
.IP \(bu
|
|
recursive and multithreaded checking,
|
|
.IP \(bu
|
|
output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph in different formats,
|
|
.IP \(bu
|
|
support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file links,
|
|
.IP \(bu
|
|
restriction of link checking with URL filters,
|
|
.IP \(bu
|
|
proxy support,
|
|
.IP \(bu
|
|
username/password authorization for HTTP, FTP and Telnet,
|
|
.IP \(bu
|
|
support for robots.txt exclusion protocol,
|
|
.IP \(bu
|
|
support for Cookies
|
|
.IP \(bu
|
|
support for HTML5
|
|
.IP \(bu
|
|
HTML and CSS syntax check
|
|
.IP \(bu
|
|
Antivirus check
|
|
.IP \(bu
|
|
a command line, GUI and web interface
|
|
.SH EXAMPLES
|
|
The most common use checks the given domain recursively:
|
|
\fBlinkchecker http://www.example.com/\fP
|
|
.br
|
|
Beware that this checks the whole site which can have thousands of URLs.
|
|
Use the \fB\-r\fP option to restrict the recursion depth.
|
|
.br
|
|
Don't check URLs with \fB/secret\fP in its name. All other links are checked as usual:
|
|
\fBlinkchecker \-\-ignore\-url=/secret mysite.example.com\fP
|
|
.br
|
|
Checking a local HTML file on Unix:
|
|
\fBlinkchecker ../bla.html\fP
|
|
.br
|
|
Checking a local HTML file on Windows:
|
|
\fBlinkchecker c:\\temp\\test.html\fP
|
|
.br
|
|
You can skip the \fBhttp://\fP url part if the domain starts with \fBwww.\fP:
|
|
\fBlinkchecker www.example.com\fP
|
|
.br
|
|
You can skip the \fBftp://\fP url part if the domain starts with \fBftp.\fP:
|
|
\fBlinkchecker \-r0 ftp.example.com\fP
|
|
.br
|
|
Generate a sitemap graph and convert it with the graphviz dot utility:
|
|
\fBlinkchecker \-odot \-v www.example.com | dot \-Tps > sitemap.ps\fP
|
|
.
|
|
.SH OPTIONS
|
|
.SS General options
|
|
.TP
|
|
\fB\-f\fP\fIFILENAME\fP, \fB\-\-config=\fP\fIFILENAME\fP
|
|
Use \fIFILENAME\fP as configuration file. As default LinkChecker
|
|
uses \fB~/.linkchecker/linkcheckerrc\fP.
|
|
.TP
|
|
\fB\-h\fP, \fB\-\-help\fP
|
|
Help me! Print usage information for this program.
|
|
.TP
|
|
\fB\-\-stdin\fP
|
|
Read list of white-space separated URLs to check from stdin.
|
|
.TP
|
|
\fB\-t\fP\fINUMBER\fP, \fB\-\-threads=\fP\fINUMBER\fP
|
|
Generate no more than the given number of threads. Default number
|
|
of threads is 100. To disable threading specify a non-positive number.
|
|
.TP
|
|
\fB\-V\fP, \fB\-\-version\fP
|
|
Print version and exit.
|
|
.TP
|
|
\fB\-\-list\-plugins\fP
|
|
Print available check plugins and exit.
|
|
.
|
|
.SS Output options
|
|
.TP
|
|
\fB\-D\fP\fISTRING\fP, \fB\-\-debug=\fP\fISTRING\fP
|
|
Print debugging output for the given logger.
|
|
Available loggers are \fBcmdline\fP, \fBchecking\fP,
|
|
\fBcache\fP, \fBgui\fP, \fBdns\fP and \fBall\fP.
|
|
Specifying \fBall\fP is an alias for specifying all available loggers.
|
|
The option can be given multiple times to debug with more
|
|
than one logger.
|
|
.BR
|
|
For accurate results, threading will be disabled during debug runs.
|
|
.TP
|
|
\fB\-F\fP\fITYPE\fP[\fB/\fP\fIENCODING\fP][\fB/\fP\fIFILENAME\fP], \fB\-\-file\-output=\fP\fITYPE\fP[\fB/\fP\fIENCODING\fP][\fB/\fP\fIFILENAME\fP]
|
|
Output to a file \fBlinkchecker\-out.\fP\fITYPE\fP,
|
|
\fB$HOME/.linkchecker/blacklist\fP for
|
|
\fBblacklist\fP output, or \fIFILENAME\fP if specified.
|
|
The \fIENCODING\fP specifies the output encoding, the default is
|
|
that of your locale.
|
|
Valid encodings are listed at
|
|
\fBhttp://docs.python.org/library/\:codecs.html#standard-encodings\fP.
|
|
.br
|
|
The \fIFILENAME\fP and \fIENCODING\fP parts of the \fBnone\fP output type
|
|
will be ignored, else if the file already exists, it will be overwritten.
|
|
You can specify this option more than once. Valid file output types
|
|
are \fBtext\fP, \fBhtml\fP, \fBsql\fP,
|
|
\fBcsv\fP, \fBgml\fP, \fBdot\fP, \fBxml\fP, \fBsitemap\fP, \fBnone\fP or
|
|
\fBblacklist\fP.
|
|
Default is no file output. The various output types are documented
|
|
below. Note that you can suppress all console output
|
|
with the option \fB\-o none\fP.
|
|
.TP
|
|
\fB\-\-no\-status\fP
|
|
Do not print check status messages.
|
|
.TP
|
|
\fB\-\-no\-warnings\fP
|
|
Don't log warnings. Default is to log warnings.
|
|
.TP
|
|
\fB\-o\fP\fITYPE\fP[\fB/\fP\fIENCODING\fP], \fB\-\-output=\fP\fITYPE\fP[\fB/\fP\fIENCODING\fP]
|
|
Specify output type as \fBtext\fP, \fBhtml\fP, \fBsql\fP,
|
|
\fBcsv\fP, \fBgml\fP, \fBdot\fP, \fBxml\fP, \fBsitemap\fP, \fBnone\fP or
|
|
\fBblacklist\fP.
|
|
Default type is \fBtext\fP. The various output types are documented
|
|
below.
|
|
.br
|
|
The \fIENCODING\fP specifies the output encoding, the default is
|
|
that of your locale. Valid encodings are listed at
|
|
\fBhttp://docs.python.org/library/\:codecs.html#standard-encodings\fP.
|
|
.TP
|
|
\fB\-q\fP, \fB\-\-quiet\fP
|
|
Quiet operation, an alias for \fB\-o none\fP.
|
|
This is only useful with \fB\-F\fP.
|
|
.TP
|
|
\fB\-v\fP, \fB\-\-verbose\fP
|
|
Log all checked URLs. Default is to log only errors and warnings.
|
|
.TP
|
|
\fB\-W\fP\fIREGEX\fP, \fB\-\-warning\-regex=\fIREGEX\fP
|
|
Define a regular expression which prints a warning if it matches any
|
|
content of the checked link.
|
|
This applies only to valid pages, so we can get their content.
|
|
.br
|
|
Use this to check for pages that contain some form of error, for example
|
|
"This page has moved" or "Oracle Application error".
|
|
.br
|
|
Note that multiple values can be combined in the regular expression,
|
|
for example "(This page has moved|Oracle Application error)".
|
|
.br
|
|
See section \fBREGULAR EXPRESSIONS\fP for more info.
|
|
.SS Checking options
|
|
.TP
|
|
\fB\-\-cookiefile=\fP\fIFILENAME\fP
|
|
Read a file with initial cookie data. The cookie data
|
|
format is explained below.
|
|
.TP
|
|
\fB\-\-check\-extern
|
|
Check external URLs.
|
|
.TP
|
|
\fB\-\-ignore\-url=\fP\fIREGEX\fP
|
|
URLs matching the given regular expression will be ignored and not checked.
|
|
.br
|
|
This option can be given multiple times.
|
|
.br
|
|
See section \fBREGULAR EXPRESSIONS\fP for more info.
|
|
.TP
|
|
\fB\-N\fP\fISTRING\fP, \fB\-\-nntp\-server=\fP\fISTRING\fP
|
|
Specify an NNTP server for \fBnews:\fP links. Default is the
|
|
environment variable \fBNNTP_SERVER\fP. If no host is given,
|
|
only the syntax of the link is checked.
|
|
.TP
|
|
\fB\-\-no\-follow\-url=\fP\fIREGEX\fP
|
|
Check but do not recurse into URLs matching the given regular
|
|
expression.
|
|
.br
|
|
This option can be given multiple times.
|
|
.br
|
|
See section \fBREGULAR EXPRESSIONS\fP for more info.
|
|
.TP
|
|
\fB\-p\fP, \fB\-\-password\fP
|
|
Read a password from console and use it for HTTP and FTP authorization.
|
|
For FTP the default password is \fBanonymous@\fP. For HTTP there is
|
|
no default password. See also \fB\-u\fP.
|
|
.TP
|
|
\fB\-r\fP\fINUMBER\fP, \fB\-\-recursion\-level=\fP\fINUMBER\fP
|
|
Check recursively all links up to given depth.
|
|
A negative depth will enable infinite recursion.
|
|
Default depth is infinite.
|
|
.TP
|
|
\fB\-\-timeout=\fP\fINUMBER\fP
|
|
Set the timeout for connection attempts in seconds. The default timeout
|
|
is 60 seconds.
|
|
.TP
|
|
\fB\-u\fP\fISTRING\fP, \fB\-\-user=\fP\fISTRING\fP
|
|
Try the given username for HTTP and FTP authorization.
|
|
For FTP the default username is \fBanonymous\fP. For HTTP there is
|
|
no default username. See also \fB\-p\fP.
|
|
.TP
|
|
\fB\-\-user\-agent=\fP\fISTRING\fP
|
|
Specify the User-Agent string to send to the HTTP server, for example
|
|
"Mozilla/4.0". The default is "LinkChecker/X.Y" where X.Y is the current
|
|
version of LinkChecker.
|
|
|
|
.SH "CONFIGURATION FILES"
|
|
Configuration files can specify all options above. They can also
|
|
specify some options that cannot be set on the command line.
|
|
See \fBlinkcheckerrc\fP(5) for more info.
|
|
|
|
.SH OUTPUT TYPES
|
|
Note that by default only errors and warnings are logged.
|
|
You should use the \fB\-\-verbose\fP option to get the complete URL list,
|
|
especially when outputting a sitemap graph format.
|
|
|
|
.TP
|
|
\fBtext\fP
|
|
Standard text logger, logging URLs in keyword: argument fashion.
|
|
.TP
|
|
\fBhtml\fP
|
|
Log URLs in keyword: argument fashion, formatted as HTML.
|
|
Additionally has links to the referenced pages. Invalid URLs have
|
|
HTML and CSS syntax check links appended.
|
|
.TP
|
|
\fBcsv\fP
|
|
Log check result in CSV format with one URL per line.
|
|
.TP
|
|
\fBgml\fP
|
|
Log parent-child relations between linked URLs as a GML sitemap graph.
|
|
.TP
|
|
\fBdot\fP
|
|
Log parent-child relations between linked URLs as a DOT sitemap graph.
|
|
.TP
|
|
\fBgxml\fP
|
|
Log check result as a GraphXML sitemap graph.
|
|
.TP
|
|
\fBxml\fP
|
|
Log check result as machine-readable XML.
|
|
.TP
|
|
\fBsitemap\fP
|
|
Log check result as an XML sitemap whose protocol is documented at
|
|
\fBhttp://www.sitemaps.org/protocol.html\fP.
|
|
.TP
|
|
\fBsql\fP
|
|
Log check result as SQL script with INSERT commands. An example
|
|
script to create the initial SQL table is included as create.sql.
|
|
.TP
|
|
\fBblacklist\fP
|
|
Suitable for cron jobs. Logs the check result into a file
|
|
\fB~/.linkchecker/blacklist\fP which only contains entries with invalid
|
|
URLs and the number of times they have failed.
|
|
.TP
|
|
\fBnone\fP
|
|
Logs nothing. Suitable for debugging or checking the exit code.
|
|
.
|
|
.SH REGULAR EXPRESSIONS
|
|
LinkChecker accepts Python regular expressions.
|
|
See \fBhttp://docs.python.org/\:howto/regex.html\fP for an introduction.
|
|
|
|
An addition is that a leading exclamation mark negates the regular
|
|
expression.
|
|
.
|
|
.SH COOKIE FILES
|
|
A cookie file contains standard HTTP header (RFC 2616) data with the
|
|
following possible names:
|
|
.
|
|
.TP
|
|
\fBHost\fP (required)
|
|
Sets the domain the cookies are valid for.
|
|
.TP
|
|
\fBPath\fP (optional)
|
|
Gives the path the cookies are value for; default path is \fB/\fP.
|
|
.TP
|
|
\fBSet-cookie\fP (required)
|
|
Set cookie name/value. Can be given more than once.
|
|
.PP
|
|
Multiple entries are separated by a blank line.
|
|
.
|
|
The example below will send two cookies to all URLs starting with
|
|
\fBhttp://example.com/hello/\fP and one to all URLs starting
|
|
with \fBhttps://example.org/\fP:
|
|
|
|
Host: example.com
|
|
Path: /hello
|
|
Set-cookie: ID="smee"
|
|
Set-cookie: spam="egg"
|
|
|
|
Host: example.org
|
|
Set-cookie: baggage="elitist"; comment="hologram"
|
|
|
|
.SH PROXY SUPPORT
|
|
To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or $ftp_proxy
|
|
environment variables to the proxy URL. The URL should be of the form
|
|
\fBhttp://\fP[\fIuser\fP\fB:\fP\fIpass\fP\fB@\fP]\fIhost\fP[\fB:\fP\fIport\fP].
|
|
LinkChecker also detects manual proxy settings of Internet Explorer under
|
|
Windows systems, and gconf or KDE on Linux systems.
|
|
On a Mac use the Internet Config to select a proxy.
|
|
.
|
|
You can also set a comma-separated domain list in the $no_proxy environment
|
|
variables to ignore any proxy settings for these domains.
|
|
.
|
|
Setting a HTTP proxy on Unix for example looks like this:
|
|
|
|
export http_proxy="http://proxy.example.com:8080"
|
|
|
|
Proxy authentication is also supported:
|
|
|
|
export http_proxy="http://user1:mypass@proxy.example.org:8081"
|
|
|
|
Setting a proxy on the Windows command prompt:
|
|
|
|
set http_proxy=http://proxy.example.com:8080
|
|
|
|
.SH PERFORMED CHECKS
|
|
All URLs have to pass a preliminary syntax test. Minor quoting
|
|
mistakes will issue a warning, all other invalid syntax issues
|
|
are errors.
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.
|
|
.
|
|
.TP
|
|
HTTP links (\fBhttp:\fP, \fBhttps:\fP)
|
|
After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
All final HTTP status codes other than 2xx are errors.
|
|
.
|
|
HTML page contents are checked for recursion.
|
|
.TP
|
|
Local files (\fBfile:\fP)
|
|
A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example device files,
|
|
unreadable or non-existing files are errors.
|
|
.
|
|
HTML or other parseable file contents are checked for recursion.
|
|
.TP
|
|
Mail links (\fBmailto:\fP)
|
|
A mailto: link eventually resolves to a list of email addresses.
|
|
If one address fails, the whole list will fail.
|
|
For each mail address we check the following things:
|
|
.
|
|
1) Check the adress syntax, both of the part before and after
|
|
the @ sign.
|
|
2) Look up the MX DNS records. If we found no MX record,
|
|
print an error.
|
|
3) Check if one of the mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If no host accepts SMTP, we print a warning.
|
|
4) Try to verify the address with the VRFY command. If we got
|
|
an answer, print the verified address as an info.
|
|
.TP
|
|
FTP links (\fBftp:\fP)
|
|
|
|
For FTP links we do:
|
|
|
|
1) connect to the specified host
|
|
2) try to login with the given user and password. The default
|
|
user is ``anonymous``, the default password is ``anonymous@``.
|
|
3) try to change to the given directory
|
|
4) list the file with the NLST command
|
|
|
|
.TP
|
|
Telnet links (``telnet:``)
|
|
|
|
We try to connect and if user/password are given, login to the
|
|
given telnet server.
|
|
|
|
.TP
|
|
NNTP links (``news:``, ``snews:``, ``nntp``)
|
|
|
|
We try to connect to the given NNTP server. If a news group or
|
|
article is specified, try to request it from the server.
|
|
|
|
.TP
|
|
Unsupported links (``javascript:``, etc.)
|
|
|
|
An unsupported link will only print a warning. No further checking
|
|
will be made.
|
|
|
|
The complete list of recognized, but unsupported links can be found
|
|
in the \fBlinkcheck/checker/unknownurl.py\fP source file.
|
|
The most prominent of them should be JavaScript links.
|
|
|
|
.SH PLUGINS
|
|
There are two plugin types: connection and content plugins.
|
|
.
|
|
Connection plugins are run after a successful connection to the
|
|
URL host.
|
|
.
|
|
Content plugins are run if the URL type has content
|
|
(mailto: URLs have no content for example) and if the check is not
|
|
forbidden (ie. by HTTP robots.txt).
|
|
.
|
|
See \fBlinkchecker \-\-list\-plugins\fP for a list of plugins and
|
|
their documentation. All plugins are enabled via the \fBlinkcheckerrc\fP(5)
|
|
configuration file.
|
|
|
|
.SH RECURSION
|
|
Before descending recursively into a URL, it has to fulfill several
|
|
conditions. They are checked in this order:
|
|
|
|
1. A URL must be valid.
|
|
|
|
2. A URL must be parseable. This currently includes HTML files,
|
|
Opera bookmarks files, and directories. If a file type cannot
|
|
be determined (for example it does not have a common HTML file
|
|
extension, and the content does not look like HTML), it is assumed
|
|
to be non-parseable.
|
|
|
|
3. The URL content must be retrievable. This is usually the case
|
|
except for example mailto: or unknown URL types.
|
|
|
|
4. The maximum recursion level must not be exceeded. It is configured
|
|
with the \fB\-\-recursion\-level\fP option and is unlimited per default.
|
|
|
|
5. It must not match the ignored URL list. This is controlled with
|
|
the \fB\-\-ignore\-url\fP option.
|
|
|
|
6. The Robots Exclusion Protocol must allow links in the URL to be
|
|
followed recursively. This is checked by searching for a
|
|
"nofollow" directive in the HTML header data.
|
|
|
|
Note that the directory recursion reads all files in that
|
|
directory, not just a subset like \fBindex.htm*\fP.
|
|
|
|
.SH NOTES
|
|
URLs on the commandline starting with \fBftp.\fP are treated like
|
|
\fBftp://ftp.\fP, URLs starting with \fBwww.\fP are treated like
|
|
\fBhttp://www.\fP.
|
|
You can also give local files as arguments.
|
|
|
|
If you have your system configured to automatically establish a
|
|
connection to the internet (e.g. with diald), it will connect when
|
|
checking links not pointing to your local host.
|
|
Use the \fB\-\-ignore\-url\fP option to prevent this.
|
|
|
|
Javascript links are not supported.
|
|
|
|
If your platform does not support threading, LinkChecker disables it
|
|
automatically.
|
|
|
|
You can supply multiple user/password pairs in a configuration file.
|
|
|
|
When checking \fBnews:\fP links the given NNTP host doesn't need to be the
|
|
same as the host of the user browsing your pages.
|
|
.
|
|
.SH ENVIRONMENT
|
|
\fBNNTP_SERVER\fP - specifies default NNTP server
|
|
.br
|
|
\fBhttp_proxy\fP - specifies default HTTP proxy server
|
|
.br
|
|
\fBftp_proxy\fP - specifies default FTP proxy server
|
|
.br
|
|
\fBno_proxy\fP - comma-separated list of domains to not contact over a proxy server
|
|
.br
|
|
\fBLC_MESSAGES\fP, \fBLANG\fP, \fBLANGUAGE\fP - specify output language
|
|
.
|
|
.SH RETURN VALUE
|
|
The return value is 2 when
|
|
.IP \(bu
|
|
a program error occurred.
|
|
.PP
|
|
The return value is 1 when
|
|
.IP \(bu
|
|
invalid links were found or
|
|
.IP \(bu
|
|
link warnings were found and warnings are enabled
|
|
.PP
|
|
Else the return value is zero.
|
|
.
|
|
.SH LIMITATIONS
|
|
LinkChecker consumes memory for each queued URL to check. With thousands
|
|
of queued URLs the amount of consumed memory can become quite large. This
|
|
might slow down the program or even the whole system.
|
|
.
|
|
.SH FILES
|
|
\fB~/.linkchecker/linkcheckerrc\fP - default configuration file
|
|
.br
|
|
\fB~/.linkchecker/blacklist\fP - default blacklist logger output filename
|
|
.br
|
|
\fBlinkchecker\-out.\fP\fITYPE\fP - default logger file output name
|
|
.br
|
|
\fBhttp://docs.python.org/library/codecs.html#standard-encodings\fP - valid output encodings
|
|
.br
|
|
\fBhttp://docs.python.org/howto/regex.html\fP - regular expression documentation
|
|
|
|
.SH "SEE ALSO"
|
|
\fBlinkcheckerrc\fP(5)
|
|
.
|
|
.SH AUTHOR
|
|
Bastian Kleineidam <bastian.kleineidam@web.de>
|
|
.
|
|
.SH COPYRIGHT
|
|
Copyright \(co 2000-2014 Bastian Kleineidam
|