mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-04-19 22:01:00 +00:00
526 lines
23 KiB
HTML
526 lines
23 KiB
HTML
<!DOCTYPE html>
|
|
<html>
|
|
<head>
|
|
<meta charset="utf-8"/>
|
|
<style>
|
|
table.head, table.foot { width: 100%; }
|
|
td.head-rtitle, td.foot-os { text-align: right; }
|
|
td.head-vol { text-align: center; }
|
|
div.Pp { margin: 1ex 0ex; }
|
|
div.Nd, div.Bf, div.Op { display: inline; }
|
|
span.Pa, span.Ad { font-style: italic; }
|
|
span.Ms { font-weight: bold; }
|
|
dl.Bl-diag > dt { font-weight: bold; }
|
|
code.Nm, code.Fl, code.Cm, code.Ic, code.In, code.Fd, code.Fn,
|
|
code.Cd { font-weight: bold; font-family: inherit; }
|
|
</style>
|
|
<title>LINKCHECKER(1)</title>
|
|
</head>
|
|
<body>
|
|
<table class="head">
|
|
<tr>
|
|
<td class="head-ltitle">LINKCHECKER(1)</td>
|
|
<td class="head-vol">LinkChecker User Manual</td>
|
|
<td class="head-rtitle">LINKCHECKER(1)</td>
|
|
</tr>
|
|
</table>
|
|
<div class="manual-text">
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="NAME"><a class="permalink" href="#NAME">NAME</a></h1>
|
|
linkchecker - command line client to check HTML documents and websites for
|
|
broken links
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="SYNOPSIS"><a class="permalink" href="#SYNOPSIS">SYNOPSIS</a></h1>
|
|
<b>linkchecker</b> [<i>options</i>] [<i>file-or-url</i>]...
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="DESCRIPTION"><a class="permalink" href="#DESCRIPTION">DESCRIPTION</a></h1>
|
|
<dl class="Bl-tag">
|
|
<dt>LinkChecker features</dt>
|
|
<dd></dd>
|
|
</dl>
|
|
<ul class="Bl-bullet">
|
|
<li>recursive and multithreaded checking,</li>
|
|
<li>output in colored or normal text, HTML, SQL, CSV, XML or a sitemap graph
|
|
in different formats,</li>
|
|
<li>support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local
|
|
file links,</li>
|
|
<li>restriction of link checking with URL filters,</li>
|
|
<li>proxy support,</li>
|
|
<li>username/password authorization for HTTP, FTP and Telnet,</li>
|
|
<li>support for robots.txt exclusion protocol,</li>
|
|
<li>support for Cookies</li>
|
|
<li>support for HTML5</li>
|
|
<li>HTML and CSS syntax check</li>
|
|
<li>Antivirus check</li>
|
|
<li>a command line and web interface</li>
|
|
</ul>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="EXAMPLES"><a class="permalink" href="#EXAMPLES">EXAMPLES</a></h1>
|
|
<dl class="Bl-tag">
|
|
<dt>The most common use checks the given domain recursively:</dt>
|
|
<dd><b>linkchecker http://www.example.com/</b>
|
|
<br/>
|
|
Beware that this checks the whole site which can have thousands of URLs. Use
|
|
the <b>-r</b> option to restrict the recursion depth.</dd>
|
|
<dt>Don't check URLs with <b>/secret</b> in its name. All other links are
|
|
checked as usual:</dt>
|
|
<dd><b>linkchecker --ignore-url=/secret mysite.example.com</b></dd>
|
|
<dt>Checking a local HTML file on Unix:</dt>
|
|
<dd><b>linkchecker ../bla.html</b></dd>
|
|
<dt>Checking a local HTML file on Windows:</dt>
|
|
<dd><b>linkchecker c:empest.html</b></dd>
|
|
<dt>You can skip the <b>http://</b> url part if the domain starts with
|
|
<b>www.</b>:</dt>
|
|
<dd><b>linkchecker www.example.com</b></dd>
|
|
<dt>You can skip the <b>ftp://</b> url part if the domain starts with
|
|
<b>ftp.</b>:</dt>
|
|
<dd><b>linkchecker -r0 ftp.example.com</b></dd>
|
|
<dt>Generate a sitemap graph and convert it with the graphviz dot
|
|
utility:</dt>
|
|
<dd><b>linkchecker -odot -v www.example.com | dot -Tps >
|
|
sitemap.ps</b></dd>
|
|
</dl>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="OPTIONS"><a class="permalink" href="#OPTIONS">OPTIONS</a></h1>
|
|
<section class="Ss">
|
|
<h2 class="Ss" id="General_options"><a class="permalink" href="#General_options">General
|
|
options</a></h2>
|
|
<dl class="Bl-tag">
|
|
<dt><b>-f</b><i>FILENAME</i>, <b>--config=</b><i>FILENAME</i></dt>
|
|
<dd>Use <i>FILENAME</i> as configuration file. As default LinkChecker uses
|
|
<b>~/.linkchecker/linkcheckerrc</b>.</dd>
|
|
<dt><b>-h</b>, <b>--help</b></dt>
|
|
<dd>Help me! Print usage information for this program.</dd>
|
|
<dt><b>--stdin</b></dt>
|
|
<dd>Read list of white-space separated URLs to check from stdin.</dd>
|
|
<dt><b>-t</b><i>NUMBER</i>, <b>--threads=</b><i>NUMBER</i></dt>
|
|
<dd>Generate no more than the given number of threads. Default number of
|
|
threads is 10. To disable threading specify a non-positive number.</dd>
|
|
<dt><b>-V</b>, <b>--version</b></dt>
|
|
<dd>Print version and exit.</dd>
|
|
<dt><b>--list-plugins</b></dt>
|
|
<dd>Print available check plugins and exit.</dd>
|
|
</dl>
|
|
</section>
|
|
<section class="Ss">
|
|
<h2 class="Ss" id="Output_options"><a class="permalink" href="#Output_options">Output
|
|
options</a></h2>
|
|
<dl class="Bl-tag">
|
|
<dt><b>-D</b><i>STRING</i>, <b>--debug=</b><i>STRING</i></dt>
|
|
<dd>Print debugging output for the given logger. Available loggers are
|
|
<b>cmdline</b>, <b>checking</b>, <b>cache</b>, <b>dns</b>, <b>plugin</b>
|
|
and <b>all</b>. Specifying <b>all</b> is an alias for specifying all
|
|
available loggers. The option can be given multiple times to debug with
|
|
more than one logger. For accurate results, threading will be disabled
|
|
during debug runs.</dd>
|
|
<dt><b>-F</b><i>TYPE</i>[<b>/</b><i>ENCODING</i>][<b>/</b><i>FILENAME</i>],
|
|
<b>--file-output=</b><i>TYPE</i>[<b>/</b><i>ENCODING</i>][<b>/</b><i>FILENAME</i>]</dt>
|
|
<dd>Output to a file <b>linkchecker-out.</b><i>TYPE</i>,
|
|
<b>$HOME/.linkchecker/blacklist</b> for <b>blacklist</b> output, or
|
|
<i>FILENAME</i> if specified. The <i>ENCODING</i> specifies the output
|
|
encoding, the default is that of your locale. Valid encodings are listed
|
|
at
|
|
<a class="Lk" href="https://docs.python.org/library/codecs.html#standard-encodings">https://docs.python.org/library/codecs.html#standard-encodings</a>.
|
|
<br/>
|
|
The <i>FILENAME</i> and <i>ENCODING</i> parts of the <b>none</b> output type
|
|
will be ignored, else if the file already exists, it will be overwritten.
|
|
You can specify this option more than once. Valid file output types are
|
|
<b>text</b>, <b>html</b>, <b>sql</b>, <b>csv</b>, <b>gml</b>, <b>dot</b>,
|
|
<b>xml</b>, <b>sitemap</b>, <b>none</b> or <b>blacklist</b>. Default is no
|
|
file output. The various output types are documented below. Note that you
|
|
can suppress all console output with the option <b>-o none</b>.</dd>
|
|
<dt><b>--no-status</b></dt>
|
|
<dd>Do not print check status messages.</dd>
|
|
<dt><b>--no-warnings</b></dt>
|
|
<dd>Don't log warnings. Default is to log warnings.</dd>
|
|
<dt><b>-o</b><i>TYPE</i>[<b>/</b><i>ENCODING</i>],
|
|
<b>--output=</b><i>TYPE</i>[<b>/</b><i>ENCODING</i>]</dt>
|
|
<dd>Specify output type as <b>text</b>, <b>html</b>, <b>sql</b>, <b>csv</b>,
|
|
<b>gml</b>, <b>dot</b>, <b>xml</b>, <b>sitemap</b>, <b>none</b> or
|
|
<b>blacklist</b>. Default type is <b>text</b>. The various output types
|
|
are documented below.
|
|
<br/>
|
|
The <i>ENCODING</i> specifies the output encoding, the default is that of
|
|
your locale. Valid encodings are listed at
|
|
<a class="Lk" href="https://docs.python.org/library/codecs.html#standard-encodings">https://docs.python.org/library/codecs.html#standard-encodings</a>.</dd>
|
|
<dt><b>-q</b>, <b>--quiet</b></dt>
|
|
<dd>Quiet operation, an alias for <b>-o none</b>. This is only useful with
|
|
<b>-F</b>.</dd>
|
|
<dt><b>-v</b>, <b>--verbose</b></dt>
|
|
<dd>Log all checked URLs. Default is to log only errors and warnings.</dd>
|
|
<dt><b>-W</b><i>REGEX</i>, <b>--warning-regex=</b><i>REGEX</i><b></b></dt>
|
|
<dd>Define a regular expression which prints a warning if it matches any
|
|
content of the checked link. This applies only to valid pages, so we can
|
|
get their content.
|
|
<br/>
|
|
Use this to check for pages that contain some form of error, for example
|
|
"This page has moved" or "Oracle Application error".
|
|
<br/>
|
|
Note that multiple values can be combined in the regular expression, for
|
|
example "(This page has moved|Oracle Application error)".
|
|
<br/>
|
|
See section <b>REGULAR EXPRESSIONS</b> for more info.</dd>
|
|
</dl>
|
|
</section>
|
|
<section class="Ss">
|
|
<h2 class="Ss" id="Checking_options"><a class="permalink" href="#Checking_options">Checking
|
|
options</a></h2>
|
|
<dl class="Bl-tag">
|
|
<dt><b>--cookiefile=</b><i>FILENAME</i></dt>
|
|
<dd>Read a file with initial cookie data. The cookie data format is explained
|
|
below.</dd>
|
|
<dt><b>--check-extern</b></dt>
|
|
<dd>Check external URLs.</dd>
|
|
<dt><b>--ignore-url=</b><i>REGEX</i></dt>
|
|
<dd>URLs matching the given regular expression will be ignored and not
|
|
checked.
|
|
<br/>
|
|
This option can be given multiple times.
|
|
<br/>
|
|
See section <b>REGULAR EXPRESSIONS</b> for more info.</dd>
|
|
<dt><b>-N</b><i>STRING</i>, <b>--nntp-server=</b><i>STRING</i></dt>
|
|
<dd>Specify an NNTP server for <b>news:</b> links. Default is the environment
|
|
variable <b>NNTP_SERVER</b>. If no host is given, only the syntax of the
|
|
link is checked.</dd>
|
|
<dt><b>--no-follow-url=</b><i>REGEX</i></dt>
|
|
<dd>Check but do not recurse into URLs matching the given regular expression.
|
|
<br/>
|
|
This option can be given multiple times.
|
|
<br/>
|
|
See section <b>REGULAR EXPRESSIONS</b> for more info.</dd>
|
|
<dt><b>-p</b>, <b>--password</b></dt>
|
|
<dd>Read a password from console and use it for HTTP and FTP authorization.
|
|
For FTP the default password is <b>anonymous@</b>. For HTTP there is no
|
|
default password. See also <b>-u</b>.</dd>
|
|
<dt><b>-r</b><i>NUMBER</i>, <b>--recursion-level=</b><i>NUMBER</i></dt>
|
|
<dd>Check recursively all links up to given depth. A negative depth will
|
|
enable infinite recursion. Default depth is infinite.</dd>
|
|
<dt><b>--timeout=</b><i>NUMBER</i></dt>
|
|
<dd>Set the timeout for connection attempts in seconds. The default timeout is
|
|
60 seconds.</dd>
|
|
<dt><b>-u</b><i>STRING</i>, <b>--user=</b><i>STRING</i></dt>
|
|
<dd>Try the given username for HTTP and FTP authorization. For FTP the default
|
|
username is <b>anonymous</b>. For HTTP there is no default username. See
|
|
also <b>-p</b>.</dd>
|
|
<dt><b>--user-agent=</b><i>STRING</i></dt>
|
|
<dd>Specify the User-Agent string to send to the HTTP server, for example
|
|
"Mozilla/4.0". The default is "LinkChecker/X.Y" where
|
|
X.Y is the current version of LinkChecker.
|
|
<p class="Pp"></p>
|
|
</dd>
|
|
</dl>
|
|
</section>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="CONFIGURATION_FILES"><a class="permalink" href="#CONFIGURATION_FILES">CONFIGURATION
|
|
FILES</a></h1>
|
|
Configuration files can specify all options above. They can also specify some
|
|
options that cannot be set on the command line. See <a href="../man5/linkcheckerrc.5.html" class="Xr">linkcheckerrc(5)</a>
|
|
for more info.
|
|
<p class="Pp"></p>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="OUTPUT_TYPES"><a class="permalink" href="#OUTPUT_TYPES">OUTPUT
|
|
TYPES</a></h1>
|
|
Note that by default only errors and warnings are logged. You should use the
|
|
<b>--verbose</b> option to get the complete URL list, especially when
|
|
outputting a sitemap graph format.
|
|
<p class="Pp"></p>
|
|
<dl class="Bl-tag">
|
|
<dt><b>text</b></dt>
|
|
<dd>Standard text logger, logging URLs in keyword: argument fashion.</dd>
|
|
<dt><b>html</b></dt>
|
|
<dd>Log URLs in keyword: argument fashion, formatted as HTML. Additionally has
|
|
links to the referenced pages. Invalid URLs have HTML and CSS syntax check
|
|
links appended.</dd>
|
|
<dt><b>csv</b></dt>
|
|
<dd>Log check result in CSV format with one URL per line.</dd>
|
|
<dt><b>gml</b></dt>
|
|
<dd>Log parent-child relations between linked URLs as a GML sitemap
|
|
graph.</dd>
|
|
<dt><b>dot</b></dt>
|
|
<dd>Log parent-child relations between linked URLs as a DOT sitemap
|
|
graph.</dd>
|
|
<dt><b>gxml</b></dt>
|
|
<dd>Log check result as a GraphXML sitemap graph.</dd>
|
|
<dt><b>xml</b></dt>
|
|
<dd>Log check result as machine-readable XML.</dd>
|
|
<dt><b>sitemap</b></dt>
|
|
<dd>Log check result as an XML sitemap whose protocol is documented at
|
|
<a class="Lk" href="https://www.sitemaps.org/protocol.html">https://www.sitemaps.org/protocol.html</a>.</dd>
|
|
<dt><b>sql</b></dt>
|
|
<dd>Log check result as SQL script with INSERT commands. An example script to
|
|
create the initial SQL table is included as create.sql.</dd>
|
|
<dt><b>blacklist</b></dt>
|
|
<dd>Suitable for cron jobs. Logs the check result into a file
|
|
<b>~/.linkchecker/blacklist</b> which only contains entries with invalid
|
|
URLs and the number of times they have failed.</dd>
|
|
<dt><b>none</b></dt>
|
|
<dd>Logs nothing. Suitable for debugging or checking the exit code.</dd>
|
|
</dl>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="REGULAR_EXPRESSIONS"><a class="permalink" href="#REGULAR_EXPRESSIONS">REGULAR
|
|
EXPRESSIONS</a></h1>
|
|
LinkChecker accepts Python regular expressions. See
|
|
<a class="Lk" href="https://docs.python.org/howto/regex.html">https://docs.python.org/howto/regex.html</a>
|
|
for an introduction.
|
|
<p class="Pp">An addition is that a leading exclamation mark negates the regular
|
|
expression.</p>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="COOKIE_FILES"><a class="permalink" href="#COOKIE_FILES">COOKIE
|
|
FILES</a></h1>
|
|
A cookie file contains standard HTTP header (RFC 2616) data with the following
|
|
possible names:
|
|
<dl class="Bl-tag">
|
|
<dt><b>Host</b> (required)</dt>
|
|
<dd>Sets the domain the cookies are valid for.</dd>
|
|
<dt><b>Path</b> (optional)</dt>
|
|
<dd>Gives the path the cookies are value for; default path is <b>/</b>.</dd>
|
|
<dt><b>Set-cookie</b> (required)</dt>
|
|
<dd>Set cookie name/value. Can be given more than once.</dd>
|
|
</dl>
|
|
<p class="Pp">Multiple entries are separated by a blank line. The example below
|
|
will send two cookies to all URLs starting with
|
|
<b>http://example.com/hello/</b> and one to all URLs starting with
|
|
<b>https://example.org/</b>:</p>
|
|
<pre>
|
|
Host: example.com
|
|
Path: /hello
|
|
Set-cookie: ID="smee"
|
|
Set-cookie: spam="egg"
|
|
</pre>
|
|
<pre>
|
|
Host: example.org
|
|
Set-cookie: baggage="elitist"; comment="hologram"
|
|
</pre>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="PROXY_SUPPORT"><a class="permalink" href="#PROXY_SUPPORT">PROXY
|
|
SUPPORT</a></h1>
|
|
To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or
|
|
$ftp_proxy environment variables to the proxy URL. The URL should be of the
|
|
form
|
|
<b>http://</b>[<i>user</i><b>:</b><i>pass</i><b>@</b>]<i>host</i>[<b>:</b><i>port</i>].
|
|
LinkChecker also detects manual proxy settings of Internet Explorer under
|
|
Windows systems, and gconf or KDE on Linux systems. On a Mac use the Internet
|
|
Config to select a proxy.
|
|
<p class="Pp">You can also set a comma-separated domain list in the $no_proxy
|
|
environment variables to ignore any proxy settings for these domains.</p>
|
|
<dl class="Bl-tag">
|
|
<dt>Setting a HTTP proxy on Unix for example looks like this:</dt>
|
|
<dd><b>export http_proxy="http://proxy.example.com:8080"</b></dd>
|
|
<dt>Proxy authentication is also supported:</dt>
|
|
<dd><b>export
|
|
http_proxy="http://user1:mypass@proxy.example.org:8081"</b></dd>
|
|
<dt>Setting a proxy on the Windows command prompt:</dt>
|
|
<dd><b>set http_proxy=http://proxy.example.com:8080</b></dd>
|
|
</dl>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="PERFORMED_CHECKS"><a class="permalink" href="#PERFORMED_CHECKS">PERFORMED
|
|
CHECKS</a></h1>
|
|
All URLs have to pass a preliminary syntax test. Minor quoting mistakes will
|
|
issue a warning, all other invalid syntax issues are errors. After the syntax
|
|
check passes, the URL is queued for connection checking. All connection check
|
|
types are described below.
|
|
<dl class="Bl-tag">
|
|
<dt>HTTP links (<b>http:</b>, <b>https:</b>)</dt>
|
|
<dd>After connecting to the given HTTP server the given path or query is
|
|
requested. All redirections are followed, and if user/password is given it
|
|
will be used as authorization when necessary. All final HTTP status codes
|
|
other than 2xx are errors.</dd>
|
|
</dl>
|
|
<dl class="Bl-tag">
|
|
<dt></dt>
|
|
<dd>HTML page contents are checked for recursion.</dd>
|
|
</dl>
|
|
<dl class="Bl-tag">
|
|
<dt>Local files (<b>file:</b>)</dt>
|
|
<dd>A regular, readable file that can be opened is valid. A readable directory
|
|
is also valid. All other files, for example device files, unreadable or
|
|
non-existing files are errors.</dd>
|
|
</dl>
|
|
<dl class="Bl-tag">
|
|
<dt></dt>
|
|
<dd>HTML or other parseable file contents are checked for recursion.</dd>
|
|
</dl>
|
|
<dl class="Bl-tag">
|
|
<dt>Mail links (<b>mailto:</b>)</dt>
|
|
<dd>A mailto: link eventually resolves to a list of email addresses. If one
|
|
address fails, the whole list will fail. For each mail address we check
|
|
the following things:
|
|
<br/>
|
|
1) Check the adress syntax, both of the part before and after the @ sign.
|
|
<br/>
|
|
2) Look up the MX DNS records. If we found no MX record, print an error.
|
|
<br/>
|
|
3) Check if one of the mail hosts accept an SMTP connection. Check hosts
|
|
with higher priority first. If no host accepts SMTP, we print a warning.
|
|
<br/>
|
|
4) Try to verify the address with the VRFY command. If we got an answer,
|
|
print the verified address as an info.
|
|
<p class="Pp"></p>
|
|
</dd>
|
|
<dt>FTP links (<b>ftp:</b>)</dt>
|
|
<dd>For FTP links we do:
|
|
<br/>
|
|
1) connect to the specified host
|
|
<br/>
|
|
2) try to login with the given user and password. The default user is
|
|
<b>anonymous</b>, the default password is <b>anonymous@</b>.
|
|
<br/>
|
|
3) try to change to the given directory
|
|
<br/>
|
|
4) list the file with the NLST command
|
|
<p class="Pp"></p>
|
|
</dd>
|
|
<dt>Telnet links (<b>telnet:</b>)</dt>
|
|
<dd>We try to connect and if user/password are given, login to the given
|
|
telnet server.
|
|
<p class="Pp"></p>
|
|
</dd>
|
|
<dt>NNTP links (<b>news:</b>, <b>snews:</b>, <b>nntp</b>)</dt>
|
|
<dd>We try to connect to the given NNTP server. If a news group or article is
|
|
specified, try to request it from the server.
|
|
<p class="Pp"></p>
|
|
</dd>
|
|
<dt>Unsupported links (<b>javascript:</b>, etc.)</dt>
|
|
<dd>An unsupported link will only print a warning. No further checking will be
|
|
made.</dd>
|
|
</dl>
|
|
<dl class="Bl-tag">
|
|
<dt></dt>
|
|
<dd>The complete list of recognized, but unsupported links can be found in the
|
|
<a class="Lk" href="https://github.com/linkchecker/linkchecker/blob/master/linkcheck/checker/unknownurl.py">linkcheck/checker/unknownurl.py</a>
|
|
source file. The most prominent of them should be JavaScript links.</dd>
|
|
</dl>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="PLUGINS"><a class="permalink" href="#PLUGINS">PLUGINS</a></h1>
|
|
There are two plugin types: connection and content plugins. Connection plugins
|
|
are run after a successful connection to the URL host. Content plugins are run
|
|
if the URL type has content (mailto: URLs have no content for example) and if
|
|
the check is not forbidden (ie. by HTTP robots.txt).
|
|
<p class="Pp">See <b>linkchecker --list-plugins</b> for a list of plugins and
|
|
their documentation. All plugins are enabled via the <a href="../man5/linkcheckerrc.5.html" class="Xr">linkcheckerrc(5)</a>
|
|
configuration file.</p>
|
|
<p class="Pp"></p>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="RECURSION"><a class="permalink" href="#RECURSION">RECURSION</a></h1>
|
|
Before descending recursively into a URL, it has to fulfill several conditions.
|
|
They are checked in this order:
|
|
<p class="Pp">1. A URL must be valid.</p>
|
|
<p class="Pp">2. A URL must be parseable. This currently includes HTML files,
|
|
Opera bookmarks files, and directories. If a file type cannot
|
|
be determined (for example it does not have a common HTML file
|
|
extension, and the content does not look like HTML), it is assumed
|
|
to be non-parseable.</p>
|
|
<p class="Pp">3. The URL content must be retrievable. This is usually the case
|
|
except for example mailto: or unknown URL types.</p>
|
|
<p class="Pp">4. The maximum recursion level must not be exceeded. It is
|
|
configured
|
|
with the <b>--recursion-level</b> option and is unlimited per default.</p>
|
|
<p class="Pp">5. It must not match the ignored URL list. This is controlled with
|
|
the <b>--ignore-url</b> option.</p>
|
|
<p class="Pp">6. The Robots Exclusion Protocol must allow links in the URL to be
|
|
followed recursively. This is checked by searching for a
|
|
"nofollow" directive in the HTML header data.</p>
|
|
<p class="Pp">Note that the directory recursion reads all files in that
|
|
directory, not just a subset like <b>index.htm*</b>.</p>
|
|
<p class="Pp"></p>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="NOTES"><a class="permalink" href="#NOTES">NOTES</a></h1>
|
|
URLs on the commandline starting with <b>ftp.</b> are treated like
|
|
<b>ftp://ftp.</b>, URLs starting with <b>www.</b> are treated like
|
|
<b>http://www.</b>. You can also give local files as arguments.
|
|
<p class="Pp">If you have your system configured to automatically establish a
|
|
connection to the internet (e.g. with diald), it will connect when checking
|
|
links not pointing to your local host. Use the <b>--ignore-url</b> option to
|
|
prevent this.</p>
|
|
<p class="Pp">Javascript links are not supported.</p>
|
|
<p class="Pp">If your platform does not support threading, LinkChecker disables
|
|
it automatically.</p>
|
|
<p class="Pp">You can supply multiple user/password pairs in a configuration
|
|
file.</p>
|
|
<p class="Pp">When checking <b>news:</b> links the given NNTP host doesn't need
|
|
to be the same as the host of the user browsing your pages.</p>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="ENVIRONMENT"><a class="permalink" href="#ENVIRONMENT">ENVIRONMENT</a></h1>
|
|
<b>NNTP_SERVER</b> - specifies default NNTP server
|
|
<br/>
|
|
<b>http_proxy</b> - specifies default HTTP proxy server
|
|
<br/>
|
|
<b>ftp_proxy</b> - specifies default FTP proxy server
|
|
<br/>
|
|
<b>no_proxy</b> - comma-separated list of domains to not contact over a proxy
|
|
server
|
|
<br/>
|
|
<b>LC_MESSAGES</b>, <b>LANG</b>, <b>LANGUAGE</b> - specify output language
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="RETURN_VALUE"><a class="permalink" href="#RETURN_VALUE">RETURN
|
|
VALUE</a></h1>
|
|
The return value is 2 when
|
|
<dl class="Bl-tag">
|
|
<dt>•</dt>
|
|
<dd>a program error occurred.</dd>
|
|
</dl>
|
|
<p class="Pp">The return value is 1 when</p>
|
|
<ul class="Bl-bullet">
|
|
<li>invalid links were found or</li>
|
|
<li>link warnings were found and warnings are enabled</li>
|
|
</ul>
|
|
<p class="Pp">Else the return value is zero.</p>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="LIMITATIONS"><a class="permalink" href="#LIMITATIONS">LIMITATIONS</a></h1>
|
|
LinkChecker consumes memory for each queued URL to check. With thousands of
|
|
queued URLs the amount of consumed memory can become quite large. This might
|
|
slow down the program or even the whole system.
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="FILES"><a class="permalink" href="#FILES">FILES</a></h1>
|
|
<b>~/.linkchecker/linkcheckerrc</b> - default configuration file
|
|
<br/>
|
|
<b>~/.linkchecker/blacklist</b> - default blacklist logger output filename
|
|
<br/>
|
|
<b>linkchecker-out.</b><i>TYPE</i> - default logger file output name
|
|
<br/>
|
|
<a class="Lk" href="https://docs.python.org/library/codecs.html#standard-encodings">https://docs.python.org/library/codecs.html#standard-encodings</a>
|
|
- valid output encodings
|
|
<br/>
|
|
<a class="Lk" href="https://docs.python.org/howto/regex.html">https://docs.python.org/howto/regex.html</a>
|
|
- regular expression documentation
|
|
<p class="Pp"></p>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="SEE_ALSO"><a class="permalink" href="#SEE_ALSO">SEE
|
|
ALSO</a></h1>
|
|
<a href="../man5/linkcheckerrc.5.html" class="Xr">linkcheckerrc(5)</a>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="AUTHOR"><a class="permalink" href="#AUTHOR">AUTHOR</a></h1>
|
|
Bastian Kleineidam <bastian.kleineidam@web.de>
|
|
</section>
|
|
<section class="Sh">
|
|
<h1 class="Sh" id="COPYRIGHT"><a class="permalink" href="#COPYRIGHT">COPYRIGHT</a></h1>
|
|
Copyright © 2000-2014 Bastian Kleineidam
|
|
</section>
|
|
</div>
|
|
<table class="foot">
|
|
<tr>
|
|
<td class="foot-date">2020-04-24</td>
|
|
<td class="foot-os">LinkChecker</td>
|
|
</tr>
|
|
</table>
|
|
</body>
|
|
</html>
|