Updated HTML documentation

This commit is contained in:
Bastian Kleineidam 2009-08-19 20:41:17 +02:00
parent a92e13826b
commit fb091c377f
6 changed files with 280 additions and 160 deletions

View file

@ -6,7 +6,10 @@ all: $(HELPFILES)
clean:
-rm -f *.qhc *.qch
%.qhc: %.qhcp lcdoc.qhp
%.html: %.txt html.header html.footer
(cat html.header; markdown2 $<; cat html.footer) > index.html
%.qhc: %.qhcp lcdoc.qhp index.html
qcollectiongenerator $< -o $@
favicon.ico: favicon32x32.png favicon16x16.png

5
doc/html/html.footer Normal file
View file

@ -0,0 +1,5 @@
<div class="footer">
&copy; Copyright 2009, Bastian Kleineidam.
</div>
</body>
</html>

24
doc/html/html.header Normal file
View file

@ -0,0 +1,24 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Check websites for broken links</title>
<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
<link rel="stylesheet" href="pygments.css" type="text/css" />
<style type="text/css">
img { border: 0; }
</style>
</head>
<body>
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
<table border="0"><tr>
<td><img
src="logo64x64.png" border="0" alt="LinkChecker"/></td>
<td><h1>LinkChecker</h1></td>
</tr></table>
</div>

View file

@ -8,7 +8,6 @@
<title>Check websites for broken links</title>
<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
<link rel="stylesheet" href="pygments.css" type="text/css" />
<link rel="top" title="LinkChecker" href="" />
<style type="text/css">
img { border: 0; }
</style>
@ -16,206 +15,149 @@ img { border: 0; }
</head>
<body>
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
<table border="0"><tr>
<td><a href=""><img
src="logo64x64.png" border="0" alt="LinkChecker"/></a></td>
<td><img
src="logo64x64.png" border="0" alt="LinkChecker"/></td>
<td><h1>LinkChecker</h1></td>
</tr></table>
</div>
<div class="document">
<div class="documentwrapper">
<div class="body">
<div class="section" id="check-websites-for-broken-links">
<p>To check a URL like <tt class="docutils literal"><span class="pre">http://www.myhomepage.org/</span></tt> it is enough to
execute <tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">http://www.myhomepage.org/</span></tt>. This will check the
<h1>Documentation</h1>
<h2>Basic usage</h2>
<p>To check a URL like <code>http://www.myhomepage.org/</code> it is enough to
execute <code>linkchecker http://www.myhomepage.org/</code>. This will check the
complete domain of www.myhomepage.org recursively. All links pointing
outside of the domain are also checked for validity.</p>
</div>
<div class="section" id="performed-checks">
<h2>Performed checks</h2>
<p>All URLs have to pass a preliminary syntax test. Minor quoting
mistakes will issue a warning, all other invalid syntax issues
are errors.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.</p>
<ul>
<li><p class="first">HTTP links (<tt class="docutils literal"><span class="pre">http:</span></tt>, <tt class="docutils literal"><span class="pre">https:</span></tt>)</p>
<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>
<p>After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages issue a warning.
All final HTTP status codes other than 2xx are errors.</p>
</li>
<li><p class="first">Local files (<tt class="docutils literal"><span class="pre">file:</span></tt>)</p>
All final HTTP status codes other than 2xx are errors.</p></li>
<li><p>Local files (<code>file:</code>)</p>
<p>A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example device files,
unreadable or non-existing files are errors.</p>
<p>File contents are checked for recursion.</p>
</li>
<li><p class="first">Mail links (<tt class="docutils literal"><span class="pre">mailto:</span></tt>)</p>
<p>File contents are checked for recursion.</p></li>
<li><p>Mail links (<code>mailto:</code>)</p>
<p>A mailto: link eventually resolves to a list of email addresses.
If one address fails, the whole list will fail.
For each mail address we check the following things:</p>
<ol class="arabic simple">
<li>Check the adress syntax, both of the part before and after
the &#64; sign.</li>
<li>Look up the MX DNS records. If we found no MX record,
print an error.</li>
<li>Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.</li>
<li>Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.</li>
</ol>
</li>
<li><p class="first">FTP links (<tt class="docutils literal"><span class="pre">ftp:</span></tt>)</p>
<p>1) Check the adress syntax, both of the part before and after
the @ sign.
2) Look up the MX DNS records. If we found no MX record,
print an error.
3) Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.
4) Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.</p></li>
<li><p>FTP links (<code>ftp:</code>)</p>
<p>For FTP links we do:</p>
<ol class="arabic simple">
<li>connect to the specified host</li>
<li>try to login with the given user and password. The default
user is <tt class="docutils literal"><span class="pre">anonymous</span></tt>, the default password is <tt class="docutils literal"><span class="pre">anonymous&#64;</span></tt>.</li>
<li>try to change to the given directory</li>
<li>list the file with the NLST command</li>
</ol>
</li>
<li><p class="first">Telnet links (<tt class="docutils literal"><span class="pre">telnet:</span></tt>)</p>
<p>1) connect to the specified host
2) try to login with the given user and password. The default
user is <code>anonymous</code>, the default password is <code>anonymous@</code>.
3) try to change to the given directory
4) list the file with the NLST command</p></li>
<li><p>Telnet links (<code>telnet:</code>)</p>
<p>We try to connect and if user/password are given, login to the
given telnet server.</p>
</li>
<li><p class="first">NNTP links (<tt class="docutils literal"><span class="pre">news:</span></tt>, <tt class="docutils literal"><span class="pre">snews:</span></tt>, <tt class="docutils literal"><span class="pre">nntp</span></tt>)</p>
given telnet server.</p></li>
<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>
<p>We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.</p>
</li>
<li><p class="first">Ignored links (<tt class="docutils literal"><span class="pre">javascript:</span></tt>, etc.)</p>
article is specified, try to request it from the server.</p></li>
<li><p>Ignored links (<code>javascript:</code>, etc.)</p>
<p>An ignored link will only print a warning. No further checking
will be made.</p>
<p>Here is a complete list of recognized, but ignored links. The most
prominent of them should be JavaScript links.</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">acap:</span></tt> (application configuration access protocol)</li>
<li><tt class="docutils literal"><span class="pre">afs:</span></tt> (Andrew File System global file names)</li>
<li><tt class="docutils literal"><span class="pre">chrome:</span></tt> (Mozilla specific)</li>
<li><tt class="docutils literal"><span class="pre">cid:</span></tt> (content identifier)</li>
<li><tt class="docutils literal"><span class="pre">clsid:</span></tt> (Microsoft specific)</li>
<li><tt class="docutils literal"><span class="pre">data:</span></tt> (data)</li>
<li><tt class="docutils literal"><span class="pre">dav:</span></tt> (dav)</li>
<li><tt class="docutils literal"><span class="pre">fax:</span></tt> (fax)</li>
<li><tt class="docutils literal"><span class="pre">find:</span></tt> (Mozilla specific)</li>
<li><tt class="docutils literal"><span class="pre">gopher:</span></tt> (Gopher)</li>
<li><tt class="docutils literal"><span class="pre">imap:</span></tt> (internet message access protocol)</li>
<li><tt class="docutils literal"><span class="pre">isbn:</span></tt> (ISBN (int. book numbers))</li>
<li><tt class="docutils literal"><span class="pre">javascript:</span></tt> (JavaScript)</li>
<li><tt class="docutils literal"><span class="pre">ldap:</span></tt> (Lightweight Directory Access Protocol)</li>
<li><tt class="docutils literal"><span class="pre">mailserver:</span></tt> (Access to data available from mail servers)</li>
<li><tt class="docutils literal"><span class="pre">mid:</span></tt> (message identifier)</li>
<li><tt class="docutils literal"><span class="pre">mms:</span></tt> (multimedia stream)</li>
<li><tt class="docutils literal"><span class="pre">modem:</span></tt> (modem)</li>
<li><tt class="docutils literal"><span class="pre">nfs:</span></tt> (network file system protocol)</li>
<li><tt class="docutils literal"><span class="pre">opaquelocktoken:</span></tt> (opaquelocktoken)</li>
<li><tt class="docutils literal"><span class="pre">pop:</span></tt> (Post Office Protocol v3)</li>
<li><tt class="docutils literal"><span class="pre">prospero:</span></tt> (Prospero Directory Service)</li>
<li><tt class="docutils literal"><span class="pre">rsync:</span></tt> (rsync protocol)</li>
<li><tt class="docutils literal"><span class="pre">rtsp:</span></tt> (real time streaming protocol)</li>
<li><tt class="docutils literal"><span class="pre">service:</span></tt> (service location)</li>
<li><tt class="docutils literal"><span class="pre">shttp:</span></tt> (secure HTTP)</li>
<li><tt class="docutils literal"><span class="pre">sip:</span></tt> (session initiation protocol)</li>
<li><tt class="docutils literal"><span class="pre">tel:</span></tt> (telephone)</li>
<li><tt class="docutils literal"><span class="pre">tip:</span></tt> (Transaction Internet Protocol)</li>
<li><tt class="docutils literal"><span class="pre">tn3270:</span></tt> (Interactive 3270 emulation sessions)</li>
<li><tt class="docutils literal"><span class="pre">vemmi:</span></tt> (versatile multimedia interface)</li>
<li><tt class="docutils literal"><span class="pre">wais:</span></tt> (Wide Area Information Servers)</li>
<li><tt class="docutils literal"><span class="pre">z39.50r:</span></tt> (Z39.50 Retrieval)</li>
<li><tt class="docutils literal"><span class="pre">z39.50s:</span></tt> (Z39.50 Session)</li>
<ul>
<li><code>acap:</code> (application configuration access protocol)</li>
<li><code>afs:</code> (Andrew File System global file names)</li>
<li><code>chrome:</code> (Mozilla specific)</li>
<li><code>cid:</code> (content identifier)</li>
<li><code>clsid:</code> (Microsoft specific)</li>
<li><code>data:</code> (data)</li>
<li><code>dav:</code> (dav)</li>
<li><code>fax:</code> (fax)</li>
<li><code>find:</code> (Mozilla specific)</li>
<li><code>gopher:</code> (Gopher)</li>
<li><code>imap:</code> (internet message access protocol)</li>
<li><code>isbn:</code> (ISBN (int. book numbers))</li>
<li><code>javascript:</code> (JavaScript)</li>
<li><code>ldap:</code> (Lightweight Directory Access Protocol)</li>
<li><code>mailserver:</code> (Access to data available from mail servers)</li>
<li><code>mid:</code> (message identifier)</li>
<li><code>mms:</code> (multimedia stream)</li>
<li><code>modem:</code> (modem)</li>
<li><code>nfs:</code> (network file system protocol)</li>
<li><code>opaquelocktoken:</code> (opaquelocktoken)</li>
<li><code>pop:</code> (Post Office Protocol v3)</li>
<li><code>prospero:</code> (Prospero Directory Service)</li>
<li><code>rsync:</code> (rsync protocol)</li>
<li><code>rtsp:</code> (real time streaming protocol)</li>
<li><code>service:</code> (service location)</li>
<li><code>shttp:</code> (secure HTTP)</li>
<li><code>sip:</code> (session initiation protocol)</li>
<li><code>tel:</code> (telephone)</li>
<li><code>tip:</code> (Transaction Internet Protocol)</li>
<li><code>tn3270:</code> (Interactive 3270 emulation sessions)</li>
<li><code>vemmi:</code> (versatile multimedia interface)</li>
<li><code>wais:</code> (Wide Area Information Servers)</li>
<li><code>z39.50r:</code> (Z39.50 Retrieval)</li>
<li><code>z39.50s:</code> (Z39.50 Session)</li>
</ul></li>
</ul>
</li>
</ul>
</div>
<div class="section" id="recursion">
<h2>Recursion</h2>
<p>Before descending recursively into a URL, it has to fulfill several
conditions. They are checked in this order:</p>
<ol class="arabic simple">
<li>A URL must be valid.</li>
<li>A URL must be parseable. This currently includes HTML files,
<ol>
<li><p>A URL must be valid.</p></li>
<li><p>A URL must be parseable. This currently includes HTML files,
Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.</li>
<li>The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.</li>
<li>The maximum recursion level must not be exceeded. It is configured
with the <tt class="docutils literal"><span class="pre">--recursion-level</span></tt> option and is unlimited per default.</li>
<li>It must not match the ignored URL list. This is controlled with
the <tt class="docutils literal"><span class="pre">--ignore-url</span></tt> option.</li>
<li>The Robots Exclusion Protocol must allow links in the URL to be
to be non-parseable.</p></li>
<li><p>The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.</p></li>
<li><p>The maximum recursion level must not be exceeded. It is configured
with the <code>--recursion-level</code> option and is unlimited per default.</p></li>
<li><p>It must not match the ignored URL list. This is controlled with
the <code>--ignore-url</code> option.</p></li>
<li><p>The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by searching for a
&#8220;nofollow&#8221; directive in the HTML header data.</li>
"nofollow" directive in the HTML header data.</p></li>
</ol>
<p>Note that the directory recursion reads all files in that
directory, not just a subset like <tt class="docutils literal"><span class="pre">index.htm*</span></tt>.</p>
</div>
<div class="section" id="frequently-asked-questions">
<h2>Frequently asked questions</h2>
<p><strong>Q: LinkChecker produced an error, but my web page is ok with
Mozilla/IE/Opera/...
Is this a bug in LinkChecker?</strong></p>
<p>A: Please check your web pages first. Are they really ok?
Use the <tt class="docutils literal"><span class="pre">--check-html</span></tt> option, or check if you are using a proxy
which produces the error.</p>
<p><strong>Q: I still get an error, but the page is definitely ok.</strong></p>
<p>A: Some servers deny access of automated tools (also called robots)
like LinkChecker. This is not a bug in LinkChecker but rather a
policy by the webmaster running the website you are checking. Look
the <tt class="docutils literal"><span class="pre">/robots.txt</span></tt> file which follows the <a class="reference external" href="http://www.robotstxt.org/wc/norobots-rfc.html">robots.txt exclusion standard</a>.</p>
<p><strong>Q: How can I tell LinkChecker which proxy to use?</strong></p>
<p>A: LinkChecker works transparently with proxies. In a Unix or Windows
environment, set the http_proxy, https_proxy, ftp_proxy environment
variables to a URL that identifies the proxy server before starting
LinkChecker. For example</p>
<div class="highlight-python"><pre>$ http_proxy="http://www.someproxy.com:3128"
$ export http_proxy</pre>
</div>
<p><strong>Q: The link &#8220;mailto:john&#64;company.com?subject=Hello John&#8221; is reported
as an error.</strong></p>
<p>A: You have to quote special characters (e.g. spaces) in the subject field.
The correct link should be &#8220;mailto:...?subject=Hello%20John&#8221;
Unfortunately browsers like IE and Netscape do not enforce this.</p>
<p><strong>Q: Has LinkChecker JavaScript support?</strong></p>
<p>A: No, it never will. If your page is not working without JS, it is
better checked with a browser testing tool like <a class="reference external" href="http://seleniumhq.org/">Selenium</a>.</p>
<p><strong>Q: Is LinkCheckers cookie feature insecure?</strong></p>
<p>A: If a cookie file is specified, the information will be sent
to the specified hosts.
The following restrictions apply for LinkChecker cookies:</p>
<ul class="simple">
<li>Cookies will only be sent to the originating server.</li>
<li>Cookies are only stored in memory. After LinkChecker finishes, they
are lost.</li>
<li>The cookie feature is disabled as default.</li>
</ul>
<p><strong>Q: I see LinkChecker gets a /robots.txt file for every site it
checks. What is that about?</strong></p>
<p>A: LinkChecker follows the <a class="reference external" href="http://www.robotstxt.org/wc/norobots-rfc.html">robots.txt exclusion standard</a>. To avoid
misuse of LinkChecker, you cannot turn this feature off.
See the <a class="reference external" href="http://www.robotstxt.org/wc/robots.html">Web Robot pages</a> and the <a class="reference external" href="http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt">Spidering report</a> for more info.</p>
<p><strong>Q: How do I print unreachable/dead documents of my website with
LinkChecker?</strong></p>
<p>A: No can do. This would require file system access to your web
repository and access to your web server configuration.</p>
<p><strong>Q: How do I check HTML/XML/CSS syntax with LinkChecker?</strong></p>
<p>A: Use the <tt class="docutils literal"><span class="pre">--check-html</span></tt> and <tt class="docutils literal"><span class="pre">--check-css</span></tt> options.</p>
</div>
</div>
</div>
</div>
<div class="clearer"></div>
</div>
directory, not just a subset like <code>index.htm*</code>.</p>
<div class="footer">
&copy; Copyright 2009, Bastian Kleineidam.
</div>

146
doc/html/index.txt Normal file
View file

@ -0,0 +1,146 @@
Documentation
=============
Basic usage
-----------
To check a URL like ``http://www.myhomepage.org/`` it is enough to
execute ``linkchecker http://www.myhomepage.org/``. This will check the
complete domain of www.myhomepage.org recursively. All links pointing
outside of the domain are also checked for validity.
Performed checks
----------------
All URLs have to pass a preliminary syntax test. Minor quoting
mistakes will issue a warning, all other invalid syntax issues
are errors.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.
- HTTP links (``http:``, ``https:``)
After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages issue a warning.
All final HTTP status codes other than 2xx are errors.
- Local files (``file:``)
A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example device files,
unreadable or non-existing files are errors.
File contents are checked for recursion.
- Mail links (``mailto:``)
A mailto: link eventually resolves to a list of email addresses.
If one address fails, the whole list will fail.
For each mail address we check the following things:
1) Check the adress syntax, both of the part before and after
the @ sign.
2) Look up the MX DNS records. If we found no MX record,
print an error.
3) Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.
4) Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.
- FTP links (``ftp:``)
For FTP links we do:
1) connect to the specified host
2) try to login with the given user and password. The default
user is ``anonymous``, the default password is ``anonymous@``.
3) try to change to the given directory
4) list the file with the NLST command
- Telnet links (``telnet:``)
We try to connect and if user/password are given, login to the
given telnet server.
- NNTP links (``news:``, ``snews:``, ``nntp``)
We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.
- Ignored links (``javascript:``, etc.)
An ignored link will only print a warning. No further checking
will be made.
Here is a complete list of recognized, but ignored links. The most
prominent of them should be JavaScript links.
- ``acap:`` (application configuration access protocol)
- ``afs:`` (Andrew File System global file names)
- ``chrome:`` (Mozilla specific)
- ``cid:`` (content identifier)
- ``clsid:`` (Microsoft specific)
- ``data:`` (data)
- ``dav:`` (dav)
- ``fax:`` (fax)
- ``find:`` (Mozilla specific)
- ``gopher:`` (Gopher)
- ``imap:`` (internet message access protocol)
- ``isbn:`` (ISBN (int. book numbers))
- ``javascript:`` (JavaScript)
- ``ldap:`` (Lightweight Directory Access Protocol)
- ``mailserver:`` (Access to data available from mail servers)
- ``mid:`` (message identifier)
- ``mms:`` (multimedia stream)
- ``modem:`` (modem)
- ``nfs:`` (network file system protocol)
- ``opaquelocktoken:`` (opaquelocktoken)
- ``pop:`` (Post Office Protocol v3)
- ``prospero:`` (Prospero Directory Service)
- ``rsync:`` (rsync protocol)
- ``rtsp:`` (real time streaming protocol)
- ``service:`` (service location)
- ``shttp:`` (secure HTTP)
- ``sip:`` (session initiation protocol)
- ``tel:`` (telephone)
- ``tip:`` (Transaction Internet Protocol)
- ``tn3270:`` (Interactive 3270 emulation sessions)
- ``vemmi:`` (versatile multimedia interface)
- ``wais:`` (Wide Area Information Servers)
- ``z39.50r:`` (Z39.50 Retrieval)
- ``z39.50s:`` (Z39.50 Session)
Recursion
---------
Before descending recursively into a URL, it has to fulfill several
conditions. They are checked in this order:
1. A URL must be valid.
2. A URL must be parseable. This currently includes HTML files,
Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.
3. The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.
4. The maximum recursion level must not be exceeded. It is configured
with the ``--recursion-level`` option and is unlimited per default.
5. It must not match the ignored URL list. This is controlled with
the ``--ignore-url`` option.
6. The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by searching for a
"nofollow" directive in the HTML header data.
Note that the directory recursion reads all files in that
directory, not just a subset like ``index.htm*``.

0
doc/html/logo64x64.png Executable file → Normal file
View file

Before

Width:  |  Height:  |  Size: 8 KiB

After

Width:  |  Height:  |  Size: 8 KiB