mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-18 15:00:28 +00:00
165 lines
6.8 KiB
HTML
165 lines
6.8 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
|
|
<title>Check websites for broken links</title>
|
|
<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
|
|
<link rel="stylesheet" href="pygments.css" type="text/css" />
|
|
<style type="text/css">
|
|
img { border: 0; }
|
|
</style>
|
|
|
|
</head>
|
|
<body>
|
|
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
|
|
<table border="0"><tr>
|
|
<td><img
|
|
src="logo64x64.png" border="0" alt="LinkChecker"/></td>
|
|
<td><h1>LinkChecker</h1></td>
|
|
</tr></table>
|
|
</div>
|
|
|
|
<h1>Documentation</h1>
|
|
|
|
<h2>Basic usage</h2>
|
|
|
|
<p>To check a URL like <code>http://www.myhomepage.org/</code> it is enough to
|
|
execute <code>linkchecker http://www.myhomepage.org/</code>. This will check the
|
|
complete domain of www.myhomepage.org recursively. All links pointing
|
|
outside of the domain are also checked for validity.</p>
|
|
|
|
<h2>Performed checks</h2>
|
|
|
|
<p>All URLs have to pass a preliminary syntax test. Minor quoting
|
|
mistakes will issue a warning, all other invalid syntax issues
|
|
are errors.
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.</p>
|
|
|
|
<ul>
|
|
<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>
|
|
|
|
<p>After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
Permanently moved pages issue a warning.
|
|
All final HTTP status codes other than 2xx are errors.</p></li>
|
|
<li><p>Local files (<code>file:</code>)</p>
|
|
|
|
<p>A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example device files,
|
|
unreadable or non-existing files are errors.</p>
|
|
|
|
<p>File contents are checked for recursion.</p></li>
|
|
<li><p>Mail links (<code>mailto:</code>)</p>
|
|
|
|
<p>A mailto: link eventually resolves to a list of email addresses.
|
|
If one address fails, the whole list will fail.
|
|
For each mail address the following things are checked:</p>
|
|
|
|
<p>1) Check the adress syntax, both of the part before and after
|
|
the @ sign.
|
|
2) Look up the MX DNS records. If no MX record is found,
|
|
print an error.
|
|
3) Check if one of the mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If no host accepts SMTP, a warning is printed.
|
|
4) Try to verify the address with the VRFY command. If there is
|
|
an answer, the verified address is printed as an info.</p></li>
|
|
<li><p>FTP links (<code>ftp:</code>)</p>
|
|
|
|
<p>For FTP links the following is checked:</p>
|
|
|
|
<p>1) connect to the specified host
|
|
2) try to login with the given user and password. The default
|
|
user is <code>anonymous</code>, the default password is <code>anonymous@</code>.
|
|
3) try to change to the given directory
|
|
4) list the file with the NLST command</p></li>
|
|
<li><p>Telnet links (<code>telnet:</code>)</p>
|
|
|
|
<p>A connect and if user/password are given, login to the
|
|
given telnet server is tried.</p></li>
|
|
<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>
|
|
|
|
<p>A connect is tried to connect to the given NNTP server. If a news group or
|
|
article is specified, it will be requested from the server.</p></li>
|
|
<li><p>Ignored links (<code>javascript:</code>, etc.)</p>
|
|
|
|
<p>An ignored link will only print a warning. No further checking
|
|
will be made.</p>
|
|
|
|
<p>Here is a complete list of recognized, but ignored links. The most
|
|
prominent of them should be JavaScript links.</p>
|
|
|
|
<ul>
|
|
<li><code>acap:</code> (application configuration access protocol)</li>
|
|
<li><code>afs:</code> (Andrew File System global file names)</li>
|
|
<li><code>chrome:</code> (Mozilla specific)</li>
|
|
<li><code>cid:</code> (content identifier)</li>
|
|
<li><code>clsid:</code> (Microsoft specific)</li>
|
|
<li><code>data:</code> (data)</li>
|
|
<li><code>dav:</code> (dav)</li>
|
|
<li><code>fax:</code> (fax)</li>
|
|
<li><code>find:</code> (Mozilla specific)</li>
|
|
<li><code>gopher:</code> (Gopher)</li>
|
|
<li><code>imap:</code> (internet message access protocol)</li>
|
|
<li><code>isbn:</code> (ISBN (int. book numbers))</li>
|
|
<li><code>javascript:</code> (JavaScript)</li>
|
|
<li><code>ldap:</code> (Lightweight Directory Access Protocol)</li>
|
|
<li><code>mailserver:</code> (Access to data available from mail servers)</li>
|
|
<li><code>mid:</code> (message identifier)</li>
|
|
<li><code>mms:</code> (multimedia stream)</li>
|
|
<li><code>modem:</code> (modem)</li>
|
|
<li><code>nfs:</code> (network file system protocol)</li>
|
|
<li><code>opaquelocktoken:</code> (opaquelocktoken)</li>
|
|
<li><code>pop:</code> (Post Office Protocol v3)</li>
|
|
<li><code>prospero:</code> (Prospero Directory Service)</li>
|
|
<li><code>rsync:</code> (rsync protocol)</li>
|
|
<li><code>rtsp:</code> (real time streaming protocol)</li>
|
|
<li><code>service:</code> (service location)</li>
|
|
<li><code>shttp:</code> (secure HTTP)</li>
|
|
<li><code>sip:</code> (session initiation protocol)</li>
|
|
<li><code>tel:</code> (telephone)</li>
|
|
<li><code>tip:</code> (Transaction Internet Protocol)</li>
|
|
<li><code>tn3270:</code> (Interactive 3270 emulation sessions)</li>
|
|
<li><code>vemmi:</code> (versatile multimedia interface)</li>
|
|
<li><code>wais:</code> (Wide Area Information Servers)</li>
|
|
<li><code>z39.50r:</code> (Z39.50 Retrieval)</li>
|
|
<li><code>z39.50s:</code> (Z39.50 Session)</li>
|
|
</ul></li>
|
|
</ul>
|
|
|
|
<h2>Recursion</h2>
|
|
|
|
<p>Before descending recursively into a URL, it has to fulfill several
|
|
conditions. They are checked in this order:</p>
|
|
|
|
<ol>
|
|
<li><p>A URL must be valid.</p></li>
|
|
<li><p>A URL must be parseable. This currently includes HTML files,
|
|
Opera bookmarks files, and directories. If a file type cannot
|
|
be determined (for example it does not have a common HTML file
|
|
extension, and the content does not look like HTML), it is assumed
|
|
to be non-parseable.</p></li>
|
|
<li><p>The URL content must be retrievable. This is usually the case
|
|
except for example mailto: or unknown URL types.</p></li>
|
|
<li><p>The maximum recursion level must not be exceeded. It is configured
|
|
with the <code>--recursion-level</code> option and is unlimited per default.</p></li>
|
|
<li><p>It must not match the ignored URL list. This is controlled with
|
|
the <code>--ignore-url</code> option.</p></li>
|
|
<li><p>The Robots Exclusion Protocol must allow links in the URL to be
|
|
followed recursively. This is checked by searching for a
|
|
"nofollow" directive in the HTML header data.</p></li>
|
|
</ol>
|
|
|
|
<p>Note that the directory recursion reads all files in that
|
|
directory, not just a subset like <code>index.htm*</code>.</p>
|
|
<div class="footer">
|
|
© Copyright 2010, Bastian Kleineidam.
|
|
</div>
|
|
</body>
|
|
</html>
|