mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-17 22:40:33 +00:00
194 lines
7.6 KiB
HTML
194 lines
7.6 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
|
|
<title>Check websites for broken links</title>
|
|
<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
|
|
<link rel="stylesheet" href="pygments.css" type="text/css" />
|
|
<style type="text/css">
|
|
img { border: 0; }
|
|
</style>
|
|
|
|
</head>
|
|
<body>
|
|
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
|
|
<table border="0"><tr>
|
|
<td><img
|
|
src="logo64x64.png" border="0" alt="LinkChecker"/></td>
|
|
<td><h1>LinkChecker</h1></td>
|
|
</tr></table>
|
|
</div>
|
|
|
|
<h1>Documentation</h1>
|
|
|
|
<h2>Basic usage</h2>
|
|
|
|
<p>To check a URL like <code>http://www.example.org/</code> it is enough to
|
|
type <code>linkchecker www.example.org</code> on the command line or
|
|
type <code>www.example.org</code> in the GUI application. This will check the
|
|
complete domain of <code>http://www.example.org</code> recursively. All links
|
|
pointing outside of the domain are also checked for validity.</p>
|
|
|
|
<h2>Performed checks</h2>
|
|
|
|
<p>All URLs have to pass a preliminary syntax test. Minor quoting
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.</p>
|
|
|
|
<ul>
|
|
<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>
|
|
|
|
<p>After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
Permanently moved pages issue a warning.
|
|
All final HTTP status codes other than 2xx are errors.</p></li>
|
|
<li><p>Local files (<code>file:</code>)</p>
|
|
|
|
<p>A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example unreadable,
|
|
non-existing or device files are errors.</p>
|
|
|
|
<p>File contents are checked for recursion. If they are parseable
|
|
files (for example HTML files), all links in that file will be
|
|
checked.</p></li>
|
|
<li><p>Mail links (<code>mailto:</code>)</p>
|
|
|
|
<p>A mailto: link resolves to a list of email addresses.
|
|
If one address fails the whole list will fail.
|
|
For each mail address the following things are checked:</p>
|
|
|
|
<ol>
|
|
<li>Check the adress syntax, both the part before and after
|
|
the @ sign.</li>
|
|
<li>Look up the MX DNS records. If no MX record is found,
|
|
print an error.</li>
|
|
<li>Check if one of the MX mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If none of the hosts accept SMTP, a warning is printed.</li>
|
|
<li>Try to verify the address with the VRFY command. If there is
|
|
an answer, the verified address is printed as an info.</li>
|
|
</ol></li>
|
|
<li><p>FTP links (<code>ftp:</code>)</p>
|
|
|
|
<p>For FTP links the following is checked:</p>
|
|
|
|
<ol>
|
|
<li>Connect to the specified host.</li>
|
|
<li>Try to login with the given user and password. The default
|
|
user is <code>anonymous</code>, the default password is <code>anonymous@</code>.</li>
|
|
<li>Try to change to the given directory.</li>
|
|
<li>List the file with the NLST command.</li>
|
|
</ol></li>
|
|
<li><p>Telnet links (<code>telnet:</code>)</p>
|
|
|
|
<p>A connect and if user/password are given, login to the
|
|
given telnet server is tried.</p></li>
|
|
<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>
|
|
|
|
<p>A connect is tried to connect to the given NNTP server. If a news group or
|
|
article is specified, it will be requested from the server.</p></li>
|
|
<li><p>Ignored links (<code>javascript:</code>, etc.)</p>
|
|
|
|
<p>An ignored link will print a warning, but no error. No further checking
|
|
will be made.</p>
|
|
|
|
<p>Here is the complete list of recognized, but ignored links. The most
|
|
prominent of them are JavaScript links.</p>
|
|
|
|
<ul>
|
|
<li><code>acap:</code> (application configuration access protocol)</li>
|
|
<li><code>afs:</code> (Andrew File System global file names)</li>
|
|
<li><code>chrome:</code> (Mozilla specific)</li>
|
|
<li><code>cid:</code> (content identifier)</li>
|
|
<li><code>clsid:</code> (Microsoft specific)</li>
|
|
<li><code>data:</code> (data)</li>
|
|
<li><code>dav:</code> (dav)</li>
|
|
<li><code>fax:</code> (fax)</li>
|
|
<li><code>find:</code> (Mozilla specific)</li>
|
|
<li><code>gopher:</code> (Gopher)</li>
|
|
<li><code>imap:</code> (internet message access protocol)</li>
|
|
<li><code>irc:</code> (internet relay chat)</li>
|
|
<li><code>isbn:</code> (ISBN (int. book numbers))</li>
|
|
<li><code>javascript:</code> (JavaScript)</li>
|
|
<li><code>ldap:</code> (Lightweight Directory Access Protocol)</li>
|
|
<li><code>mailserver:</code> (Access to data available from mail servers)</li>
|
|
<li><code>mid:</code> (message identifier)</li>
|
|
<li><code>mms:</code> (multimedia stream)</li>
|
|
<li><code>modem:</code> (modem)</li>
|
|
<li><code>nfs:</code> (network file system protocol)</li>
|
|
<li><code>opaquelocktoken:</code> (opaquelocktoken)</li>
|
|
<li><code>pop:</code> (Post Office Protocol v3)</li>
|
|
<li><code>prospero:</code> (Prospero Directory Service)</li>
|
|
<li><code>rsync:</code> (rsync protocol)</li>
|
|
<li><code>rtsp:</code> (real time streaming protocol)</li>
|
|
<li><code>service:</code> (service location)</li>
|
|
<li><code>shttp:</code> (secure HTTP)</li>
|
|
<li><code>sip:</code> (session initiation protocol)</li>
|
|
<li><code>tel:</code> (telephone)</li>
|
|
<li><code>tip:</code> (Transaction Internet Protocol)</li>
|
|
<li><code>tn3270:</code> (Interactive 3270 emulation sessions)</li>
|
|
<li><code>vemmi:</code> (versatile multimedia interface)</li>
|
|
<li><code>wais:</code> (Wide Area Information Servers)</li>
|
|
<li><code>z39.50r:</code> (Z39.50 Retrieval)</li>
|
|
<li><code>z39.50s:</code> (Z39.50 Session)</li>
|
|
</ul></li>
|
|
</ul>
|
|
|
|
<h2>Recursion</h2>
|
|
|
|
<p>Before descending recursively into a URL, it has to fulfill several
|
|
conditions. The conditions are checked in this order:</p>
|
|
|
|
<ol>
|
|
<li>The URL must be valid.</li>
|
|
<li>The URL must be parseable. This currently includes HTML files,
|
|
Opera bookmarks files, directories and on Windows systems MS Word
|
|
files if Word is installed on your system. If a file type cannot
|
|
be determined (for example it does not have a common HTML file
|
|
extension, and the content does not look like HTML), it is assumed
|
|
to be non-parseable.</li>
|
|
<li>The URL content must be retrievable. This is usually the case
|
|
except for example mailto: or unknown URL types.</li>
|
|
<li>The maximum recursion level must not be exceeded. It is configured
|
|
with the <code>--recursion-level</code> command line option, the recursion
|
|
level GUI option, or through the configuration file.
|
|
The recursion level is unlimited by default.</li>
|
|
<li>It must not match the ignored URL list. This is controlled with
|
|
the <code>--ignore-url</code> command line option or through the
|
|
configuration file.</li>
|
|
<li>The Robots Exclusion Protocol must allow links in the URL to be
|
|
followed recursively. This is checked by evaluating the servers
|
|
robots.txt file and searching for a "nofollow" directive in the
|
|
HTML header data.</li>
|
|
</ol>
|
|
|
|
<p>Note that the local and FTP directory recursion reads all files in that
|
|
directory, not just a subset like <code>index.htm*</code>.</p>
|
|
|
|
<h2>Configuration file</h2>
|
|
|
|
<p>Each user can edit a configuration with advanced options for
|
|
checking or filtering.</p>
|
|
|
|
<p>On Unix or OS X systems the user configuration file is at</p>
|
|
|
|
<ul>
|
|
<li><code>~/.linkchecker/linkcheckerrc</code></li>
|
|
</ul>
|
|
|
|
<p>On Windows the user configuration file is at</p>
|
|
|
|
<ul>
|
|
<li><code>%HOMEPATH%\.linkchecker\linkcheckerrc</code></li>
|
|
</ul>
|
|
<hr/>
|
|
<div class="footer">
|
|
© Copyright 2011, Bastian Kleineidam.
|
|
</div>
|
|
</body>
|
|
</html>
|