mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-17 22:40:33 +00:00
204 lines
8 KiB
HTML
204 lines
8 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
|
|
<title>Check websites for broken links</title>
|
|
<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
|
|
<link rel="stylesheet" href="pygments.css" type="text/css" />
|
|
<style type="text/css">
|
|
img { border: 0; }
|
|
</style>
|
|
|
|
</head>
|
|
<body>
|
|
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
|
|
<table border="0"><tr>
|
|
<td><img
|
|
src="logo64x64.png" border="0" alt="LinkChecker"/></td>
|
|
<td><h1>LinkChecker</h1></td>
|
|
</tr></table>
|
|
</div>
|
|
|
|
<h1>Documentation</h1>
|
|
|
|
<h2>Basic usage</h2>
|
|
|
|
<p>To check a URL like <a href="http://www.example.org/">http://www.example.org/</a>
|
|
it is enough to
|
|
type <code>linkchecker www.example.org</code> on the command line or
|
|
type <code>www.example.org</code> in the GUI application. This will check the
|
|
complete domain of <code>http://www.example.org</code> recursively. All links
|
|
pointing outside of the domain are also checked for validity.</p>
|
|
|
|
<p>Local files can also be checked. On Unix or OSX systems the syntax
|
|
is <code>file:///path/to/my/file.html</code>. On Windows the syntax is
|
|
<code>file://C|/path/to/my/file.html</code>. When directories are checked,
|
|
all included files will be checked.</p>
|
|
|
|
<p>On the GUI client the <code>Edit</code> menu has shortcuts for bookmark
|
|
files. For example if Google Chrome is installed, there will be
|
|
a menu entry called <code>Insert Google Chrome bookmark file</code> which
|
|
can be used to check all browser bookmarks.</p>
|
|
|
|
<h2>Options</h2>
|
|
|
|
<p>The commandline client options are documented
|
|
in the <a href="http://wummel.github.io/linkchecker/man1/linkchecker.1.html">linkchecker(1) manual page</a>.</p>
|
|
|
|
<p>In the GUI client, the following options are available:</p>
|
|
|
|
<ul>
|
|
<li><p>Recursive depth</p>
|
|
|
|
<p>Check recursively all links up to given depth.
|
|
A negative depth (eg. <code>-1</code>) will enable infinite recursion.</p></li>
|
|
<li><p>Verbose output</p>
|
|
|
|
<p>If set, log all checked URLs. Default is to log only errors and warnings.</p></li>
|
|
<li><p>Debug</p>
|
|
|
|
<p>Prints debugging output in a separate window which can be seen with
|
|
<code>Help -> Show debug</code>.</p></li>
|
|
<li><p>Debug memory usage</p>
|
|
|
|
<p>Profiles memory usage and writes statistics and a dump file when checking
|
|
stops. The dump file can be examined with
|
|
<a href="https://github.com/wummel/linkchecker/blob/master/tests/analyze_memdump.py">external tools</a>.
|
|
This option should only be useful for developers.</p></li>
|
|
<li><p>Warning strings</p>
|
|
|
|
<p>Log a warning if any strings are found in the content of the checked
|
|
URL. Strings are entered one per line.</p>
|
|
|
|
<p>Use this to check for pages that contain some form of error, for example
|
|
"<code>This page has moved</code>" or "<code>Oracle Application error</code>".</p></li>
|
|
<li><p>Ignoring URLs</p>
|
|
|
|
<p>URLs matching the given regular expressions will be ignored and not checked.
|
|
Useful if certain URL types should not be checked like emails (ie.
|
|
"<code>^mailto:</code>").</p></li>
|
|
</ul>
|
|
|
|
<h2>Configuration file</h2>
|
|
|
|
<p>Each user can edit a configuration with advanced options for
|
|
checking or filtering. The
|
|
<a href="http://wummel.github.io/linkchecker/man5/linkcheckerrc.5.html">linkcheckerrc(5) manual page</a>
|
|
documents all the options.</p>
|
|
|
|
<p>In the GUI client the configuration file can be edited directly from
|
|
the dialog <code>Edit -> Options</code> and the clicking on <code>Edit</code>.</p>
|
|
|
|
<h2>Performed checks</h2>
|
|
|
|
<p>All URLs have to pass a preliminary syntax test.
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.</p>
|
|
|
|
<ul>
|
|
<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>
|
|
|
|
<p>After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
Permanently moved pages (status code 301) issue a warning.
|
|
All final HTTP status codes other than 2xx are errors.</p>
|
|
|
|
<p>For HTTPS links, the SSL certificate is checked against the
|
|
given hostname. If it does not match, a warnings is printed.</p></li>
|
|
<li><p>Local files (<code>file:</code>)</p>
|
|
|
|
<p>A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example unreadable,
|
|
non-existing or device files are errors.</p>
|
|
|
|
<p>File contents are checked for recursion. If they are parseable
|
|
files (for example HTML files), all links in that file will be
|
|
checked.</p></li>
|
|
<li><p>Mail links (<code>mailto:</code>)</p>
|
|
|
|
<p>A mailto: link resolves to a list of email addresses.
|
|
If one email address fails the whole list will fail.
|
|
For each email address the following things are checked:</p>
|
|
|
|
<ol>
|
|
<li>Check the address syntax, both the part before and after
|
|
the @ sign.</li>
|
|
<li>Look up the MX DNS records. If no MX record is found,
|
|
print an error.</li>
|
|
<li>Check if one of the MX mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If none of the hosts accept SMTP, a warning is printed.</li>
|
|
<li>Try to verify the address with the VRFY command. If there is
|
|
an answer, the verified address is printed as an info.</li>
|
|
</ol></li>
|
|
<li><p>FTP links (<code>ftp:</code>)</p>
|
|
|
|
<p>For FTP links the following is checked:</p>
|
|
|
|
<ol>
|
|
<li>Connect to the specified host.</li>
|
|
<li>Try to login with the given user and password. The default
|
|
user is <code>anonymous</code>, the default password is <code>anonymous@</code>.</li>
|
|
<li>Try to change to the given directory.</li>
|
|
<li>List the file with the NLST command.</li>
|
|
</ol></li>
|
|
<li><p>Telnet links (<code>telnet:</code>)</p>
|
|
|
|
<p>A connect and if user/password are given, login to the
|
|
given telnet server is tried.</p></li>
|
|
<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>
|
|
|
|
<p>A connect is tried to connect to the given NNTP server. If a news group or
|
|
article is specified, it will be requested from the server.</p></li>
|
|
<li><p>Unsupported links (<code>javascript:</code>, etc.)</p>
|
|
|
|
<p>An unsupported link will print a warning, but no error. No further checking
|
|
will be made.</p>
|
|
|
|
<p>The complete list of recognized, but unsupported links can be seen in the
|
|
<a href="https://github.com/wummel/linkchecker/blob/master/linkcheck/checker/unknownurl.py">unknownurl.py</a>
|
|
source file. The most prominent of them are JavaScript links.</p></li>
|
|
</ul>
|
|
|
|
<h2>Recursion</h2>
|
|
|
|
<p>Before descending recursively into a URL, it has to fulfill several
|
|
conditions. The conditions are checked in this order:</p>
|
|
|
|
<ol>
|
|
<li>The URL must be valid.</li>
|
|
<li>The URL must be parseable. This currently includes HTML files,
|
|
Bookmarks files (Opera, Chrome or Safari), directories and on
|
|
Windows systems MS Word files if Word and the Pywin32 module
|
|
is installed on your system.
|
|
If a file type cannot be determined (for example it does not have
|
|
a common HTML file extension, and the content does not look like
|
|
HTML), it is assumed to be non-parseable.</li>
|
|
<li>The URL content must be retrievable. This is usually the case
|
|
except for example mailto: or unknown URL types.</li>
|
|
<li>The maximum recursion level must not be exceeded. It is configured
|
|
with the <code>--recursion-level</code> command line option, the recursion
|
|
level GUI option, or through the configuration file.
|
|
The recursion level is unlimited by default.</li>
|
|
<li>It must not match the ignored URL list. This is controlled with
|
|
the <code>--ignore-url</code> command line option or through the
|
|
configuration file.</li>
|
|
<li>The Robots Exclusion Protocol must allow links in the URL to be
|
|
followed recursively. This is checked by evaluating the servers
|
|
robots.txt file and searching for a "nofollow" directive in the
|
|
HTML header data.</li>
|
|
</ol>
|
|
|
|
<p>Note that the local and FTP directory recursion reads all files in that
|
|
directory, not just a subset like <code>index.htm*</code>.</p>
|
|
<hr/>
|
|
<div class="footer">
|
|
© Copyright 2012-2014, Bastian Kleineidam.
|
|
</div>
|
|
</body>
|
|
</html>
|