linkchecker/doc/html/index.html

205 lines
8 KiB
HTML
Raw Normal View History

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Check websites for broken links</title>
2009-07-26 21:31:48 +00:00
<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
<link rel="stylesheet" href="pygments.css" type="text/css" />
<style type="text/css">
img { border: 0; }
</style>
</head>
<body>
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
<table border="0"><tr>
2009-08-19 18:41:17 +00:00
<td><img
src="logo64x64.png" border="0" alt="LinkChecker"/></td>
<td><h1>LinkChecker</h1></td>
</tr></table>
</div>
2009-08-19 18:41:17 +00:00
<h1>Documentation</h1>
<h2>Basic usage</h2>
2012-11-08 10:15:32 +00:00
<p>To check a URL like <a href="http://www.example.org/">http://www.example.org/</a>
it is enough to
2011-10-19 18:08:27 +00:00
type <code>linkchecker www.example.org</code> on the command line or
2010-12-07 19:59:18 +00:00
type <code>www.example.org</code> in the GUI application. This will check the
complete domain of <code>http://www.example.org</code> recursively. All links
pointing outside of the domain are also checked for validity.</p>
2009-08-19 18:41:17 +00:00
2012-11-08 10:15:32 +00:00
<p>Local files can also be checked. On Unix or OSX systems the syntax
is <code>file:///path/to/my/file.html</code>. On Windows the syntax is
<code>file://C|/path/to/my/file.html</code>. When directories are checked,
all included files will be checked.</p>
<p>On the GUI client the <code>Edit</code> menu has shortcuts for bookmark
files. For example if Google Chrome is installed, there will be
a menu entry called <code>Insert Google Chrome bookmark file</code> which
can be used to check all browser bookmarks.</p>
<h2>Options</h2>
<p>The commandline client options are documented
2013-04-09 18:11:04 +00:00
in the <a href="http://wummel.github.io/linkchecker/man1/linkchecker.1.html">linkchecker(1) manual page</a>.</p>
2012-11-08 10:15:32 +00:00
<p>In the GUI client, the following options are available:</p>
<ul>
<li><p>Recursive depth</p>
<p>Check recursively all links up to given depth.
A negative depth (eg. <code>-1</code>) will enable infinite recursion.</p></li>
<li><p>Verbose output</p>
<p>If set, log all checked URLs. Default is to log only errors and warnings.</p></li>
<li><p>Debug</p>
<p>Prints debugging output in a separate window which can be seen with
<code>Help -&gt; Show debug</code>.</p></li>
<li><p>Debug memory usage</p>
<p>Profiles memory usage and writes statistics and a dump file when checking
stops. The dump file can be examined with
2013-01-06 17:10:13 +00:00
<a href="https://github.com/wummel/linkchecker/blob/master/tests/analyze_memdump.py">external tools</a>.
2012-11-08 10:15:32 +00:00
This option should only be useful for developers.</p></li>
<li><p>Warning strings</p>
<p>Log a warning if any strings are found in the content of the checked
URL. Strings are entered one per line.</p>
<p>Use this to check for pages that contain some form of error, for example
"<code>This page has moved</code>" or "<code>Oracle Application error</code>".</p></li>
<li><p>Ignoring URLs</p>
<p>URLs matching the given regular expressions will be ignored and not checked.
Useful if certain URL types should not be checked like emails (ie.
"<code>^mailto:</code>").</p></li>
</ul>
<h2>Configuration file</h2>
<p>Each user can edit a configuration with advanced options for
checking or filtering. The
2013-04-09 18:11:04 +00:00
<a href="http://wummel.github.io/linkchecker/man5/linkcheckerrc.5.html">linkcheckerrc(5) manual page</a>
2012-11-08 10:15:32 +00:00
documents all the options.</p>
<p>In the GUI client the configuration file can be edited directly from
the dialog <code>Edit -&gt; Options</code> and the clicking on <code>Edit</code>.</p>
<h2>Performed checks</h2>
2009-08-19 18:41:17 +00:00
2012-06-18 18:19:48 +00:00
<p>All URLs have to pass a preliminary syntax test.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.</p>
2009-08-19 18:41:17 +00:00
<ul>
2009-08-19 18:41:17 +00:00
<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>
<p>After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
2012-06-18 18:19:48 +00:00
Permanently moved pages (status code 301) issue a warning.
2012-06-18 21:05:44 +00:00
All final HTTP status codes other than 2xx are errors.</p>
<p>For HTTPS links, the SSL certificate is checked against the
given hostname. If it does not match, a warnings is printed.</p></li>
2009-08-19 18:41:17 +00:00
<li><p>Local files (<code>file:</code>)</p>
<p>A regular, readable file that can be opened is valid. A readable
2011-10-19 18:08:27 +00:00
directory is also valid. All other files, for example unreadable,
non-existing or device files are errors.</p>
2009-08-19 18:41:17 +00:00
2011-10-19 18:08:27 +00:00
<p>File contents are checked for recursion. If they are parseable
files (for example HTML files), all links in that file will be
checked.</p></li>
2009-08-19 18:41:17 +00:00
<li><p>Mail links (<code>mailto:</code>)</p>
2011-10-19 18:08:27 +00:00
<p>A mailto: link resolves to a list of email addresses.
2012-06-18 18:19:48 +00:00
If one email address fails the whole list will fail.
For each email address the following things are checked:</p>
2009-08-19 18:41:17 +00:00
2011-02-11 11:48:57 +00:00
<ol>
2012-06-18 18:19:48 +00:00
<li>Check the address syntax, both the part before and after
2011-02-11 11:48:57 +00:00
the @ sign.</li>
<li>Look up the MX DNS records. If no MX record is found,
print an error.</li>
2011-10-19 18:08:27 +00:00
<li>Check if one of the MX mail hosts accept an SMTP connection.
2011-02-11 11:48:57 +00:00
Check hosts with higher priority first.
2011-10-19 18:08:27 +00:00
If none of the hosts accept SMTP, a warning is printed.</li>
2011-02-11 11:48:57 +00:00
<li>Try to verify the address with the VRFY command. If there is
an answer, the verified address is printed as an info.</li>
</ol></li>
2009-08-19 18:41:17 +00:00
<li><p>FTP links (<code>ftp:</code>)</p>
2010-04-01 05:39:31 +00:00
<p>For FTP links the following is checked:</p>
2009-08-19 18:41:17 +00:00
2011-02-11 11:48:57 +00:00
<ol>
<li>Connect to the specified host.</li>
<li>Try to login with the given user and password. The default
user is <code>anonymous</code>, the default password is <code>anonymous@</code>.</li>
<li>Try to change to the given directory.</li>
<li>List the file with the NLST command.</li>
</ol></li>
2009-08-19 18:41:17 +00:00
<li><p>Telnet links (<code>telnet:</code>)</p>
2010-04-01 05:39:31 +00:00
<p>A connect and if user/password are given, login to the
given telnet server is tried.</p></li>
2009-08-19 18:41:17 +00:00
<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>
2010-04-01 05:39:31 +00:00
<p>A connect is tried to connect to the given NNTP server. If a news group or
article is specified, it will be requested from the server.</p></li>
<li><p>Unsupported links (<code>javascript:</code>, etc.)</p>
2009-08-19 18:41:17 +00:00
<p>An unsupported link will print a warning, but no error. No further checking
will be made.</p>
2009-08-19 18:41:17 +00:00
2012-11-08 10:15:32 +00:00
<p>The complete list of recognized, but unsupported links can be seen in the
2013-01-06 17:10:13 +00:00
<a href="https://github.com/wummel/linkchecker/blob/master/linkcheck/checker/unknownurl.py">unknownurl.py</a>
2012-11-08 10:15:32 +00:00
source file. The most prominent of them are JavaScript links.</p></li>
</ul>
2009-08-19 18:41:17 +00:00
<h2>Recursion</h2>
2009-08-19 18:41:17 +00:00
<p>Before descending recursively into a URL, it has to fulfill several
2011-01-31 09:48:32 +00:00
conditions. The conditions are checked in this order:</p>
2009-08-19 18:41:17 +00:00
<ol>
2011-02-11 11:48:57 +00:00
<li>The URL must be valid.</li>
<li>The URL must be parseable. This currently includes HTML files,
2012-06-18 18:19:48 +00:00
Bookmarks files (Opera, Chrome or Safari), directories and on
Windows systems MS Word files if Word and the Pywin32 module
is installed on your system.
If a file type cannot be determined (for example it does not have
a common HTML file extension, and the content does not look like
HTML), it is assumed to be non-parseable.</li>
2011-02-11 11:48:57 +00:00
<li>The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.</li>
<li>The maximum recursion level must not be exceeded. It is configured
2011-01-31 09:48:32 +00:00
with the <code>--recursion-level</code> command line option, the recursion
level GUI option, or through the configuration file.
2011-02-11 11:48:57 +00:00
The recursion level is unlimited by default.</li>
<li>It must not match the ignored URL list. This is controlled with
2011-01-31 09:48:32 +00:00
the <code>--ignore-url</code> command line option or through the
2011-02-11 11:48:57 +00:00
configuration file.</li>
<li>The Robots Exclusion Protocol must allow links in the URL to be
2011-10-19 18:08:27 +00:00
followed recursively. This is checked by evaluating the servers
robots.txt file and searching for a "nofollow" directive in the
HTML header data.</li>
</ol>
2011-02-14 20:06:49 +00:00
<p>Note that the local and FTP directory recursion reads all files in that
2009-08-19 18:41:17 +00:00
directory, not just a subset like <code>index.htm*</code>.</p>
2010-12-07 20:49:28 +00:00
<hr/>
<div class="footer">
&copy; Copyright 2012-2014, Bastian Kleineidam.
</div>
</body>
</html>