linkchecker/doc/documentation.html

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.3.7: http://docutils.sourceforge.net/" />
<title>Documentation</title>
<meta content="3" name="navigation.order" />
<meta content="Documentation" name="navigation.name" />
<meta content="4" name="navigation.order" />
<meta content="FAQ" name="navigation.name" />
<link rel="stylesheet" href="lc.css" type="text/css" />
<meta rel="SHORTCUT ICON" href="favicon.png" />
<link rel="stylesheet" href="navigation.css" type="text/css" />
<script type="text/javascript">
window.onload = function() {
  if (top.location != location) {
    top.location.href = document.location.href;
  }
}
</script>
</head>
<body>
<!-- bfknav -->
<div class="navigation">
<div class="navrow" style="padding: 0em 0em 0em 1em;">
<a href="./index.html">LinkChecker</a>
<a href="./install.html">Installation</a>
<a href="./upgrading.html">Upgrading</a>
<span>Documentation</span>
<a href="./other.html">Other</a>
<a href="./robots.html">Robots</a>

</div>
</div>
<!-- /bfknav -->
<div class="document" id="documentation">
<h1 class="title">Documentation</h1>
<div class="contents topic" id="contents">
<p class="topic-title first"><a name="contents">Contents</a></p>
<ul class="simple">
<li><a class="reference" href="#basic-usage" id="id2" name="id2">Basic usage</a></li>
<li><a class="reference" href="#performed-checks" id="id3" name="id3">Performed checks</a></li>
<li><a class="reference" href="#recursion" id="id4" name="id4">Recursion</a></li>
<li><a class="reference" href="#frequently-asked-questions" id="id5" name="id5">Frequently asked questions</a></li>
</ul>
</div>
<div class="section" id="basic-usage">
<h1><a class="toc-backref" href="#id2" name="basic-usage">Basic usage</a></h1>
<p>To check an URL like <tt class="docutils literal"><span class="pre">http://www.myhomepage.org/</span></tt> it is enough to
execute <tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">http://www.myhomepage.org/</span></tt>. This will check the
complete domain of www.myhomepage.org recursively. All links pointing
outside of the domain are also checked for validity.</p>
<p>For more options, read the man page <tt class="docutils literal"><span class="pre">linkchecker(1)</span></tt> or execute
<tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">-h</span></tt>.</p>
</div>
<div class="section" id="performed-checks">
<h1><a class="toc-backref" href="#id3" name="performed-checks">Performed checks</a></h1>
<p>All URLs have to pass a preliminary syntax test. Minor quoting
mistakes will issue a warning, all other invalid syntax issues
are errors.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.</p>
<ul>
<li><p class="first">HTTP links (<tt class="docutils literal"><span class="pre">http:</span></tt>, <tt class="docutils literal"><span class="pre">https:</span></tt>)</p>
<p>After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages issue a warning.
All final HTTP status codes other than 2xx are errors.</p>
</li>
<li><p class="first">Local files (<tt class="docutils literal"><span class="pre">file:</span></tt>)</p>
<p>A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example device files,
unreadable or non-existing files are errors.</p>
<p>File contents are checked for recursion.</p>
</li>
<li><p class="first">Mail links (<tt class="docutils literal"><span class="pre">mailto:</span></tt>)</p>
<p>A mailto: link eventually resolves to a list of email addresses.
If one address fails, the whole list will fail.
For each mail address we check the following things:</p>
<ol class="arabic simple">
<li>Look up the MX DNS records. If we found no MX record,
print an error.</li>
<li>Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.</li>
<li>Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.</li>
</ol>
</li>
<li><p class="first">FTP links (<tt class="docutils literal"><span class="pre">ftp:</span></tt>)</p>
<p>For FTP links we do:</p>
<ol class="arabic simple">
<li>connect to the specified host</li>
<li>try to login with the given user and password. The default
user is <tt class="docutils literal"><span class="pre">anonymous</span></tt>, the default password is <tt class="docutils literal"><span class="pre">anonymous&#64;</span></tt>.</li>
<li>try to change to the given directory</li>
<li>list the file with the NLST command</li>
</ol>
</li>
<li><p class="first">Gopher links (<tt class="docutils literal"><span class="pre">gopher:</span></tt>)</p>
<p>We try to send the given selector (or query) to the gopher server.</p>
</li>
<li><p class="first">Telnet links (<tt class="docutils literal"><span class="pre">telnet:</span></tt>)</p>
<p>We try to connect and if user/password are given, login to the
given telnet server.</p>
</li>
<li><p class="first">NNTP links (<tt class="docutils literal"><span class="pre">news:</span></tt>, <tt class="docutils literal"><span class="pre">snews:</span></tt>, <tt class="docutils literal"><span class="pre">nntp</span></tt>)</p>
<p>We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.</p>
</li>
<li><p class="first">Ignored links (<tt class="docutils literal"><span class="pre">javascript:</span></tt>, etc.)</p>
<p>An ignored link will only print a warning. No further checking
will be made.</p>
<p>Here is a complete list of recognized, but ignored links. The most
prominent of them should be JavaScript links.</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">acap:</span></tt>      (application configuration access protocol)</li>
<li><tt class="docutils literal"><span class="pre">afs:</span></tt>       (Andrew File System global file names)</li>
<li><tt class="docutils literal"><span class="pre">chrome:</span></tt>    (Mozilla specific)</li>
<li><tt class="docutils literal"><span class="pre">cid:</span></tt>       (content identifier)</li>
<li><tt class="docutils literal"><span class="pre">clsid:</span></tt>     (Microsoft specific)</li>
<li><tt class="docutils literal"><span class="pre">data:</span></tt>      (data)</li>
<li><tt class="docutils literal"><span class="pre">dav:</span></tt>       (dav)</li>
<li><tt class="docutils literal"><span class="pre">fax:</span></tt>       (fax)</li>
<li><tt class="docutils literal"><span class="pre">find:</span></tt>      (Mozilla specific)</li>
<li><tt class="docutils literal"><span class="pre">imap:</span></tt>      (internet message access protocol)</li>
<li><tt class="docutils literal"><span class="pre">isbn:</span></tt>      (ISBN (int. book numbers))</li>
<li><tt class="docutils literal"><span class="pre">javascript:</span></tt> (JavaScript)</li>
<li><tt class="docutils literal"><span class="pre">ldap:</span></tt>      (Lightweight Directory Access Protocol)</li>
<li><tt class="docutils literal"><span class="pre">mailserver:</span></tt> (Access to data available from mail servers)</li>
<li><tt class="docutils literal"><span class="pre">mid:</span></tt>       (message identifier)</li>
<li><tt class="docutils literal"><span class="pre">mms:</span></tt>       (multimedia stream)</li>
<li><tt class="docutils literal"><span class="pre">modem:</span></tt>     (modem)</li>
<li><tt class="docutils literal"><span class="pre">nfs:</span></tt>       (network file system protocol)</li>
<li><tt class="docutils literal"><span class="pre">opaquelocktoken:</span></tt> (opaquelocktoken)</li>
<li><tt class="docutils literal"><span class="pre">pop:</span></tt>       (Post Office Protocol v3)</li>
<li><tt class="docutils literal"><span class="pre">prospero:</span></tt>  (Prospero Directory Service)</li>
<li><tt class="docutils literal"><span class="pre">rsync:</span></tt>     (rsync protocol)</li>
<li><tt class="docutils literal"><span class="pre">rtsp:</span></tt>      (real time streaming protocol)</li>
<li><tt class="docutils literal"><span class="pre">service:</span></tt>   (service location)</li>
<li><tt class="docutils literal"><span class="pre">shttp:</span></tt>     (secure HTTP)</li>
<li><tt class="docutils literal"><span class="pre">sip:</span></tt>       (session initiation protocol)</li>
<li><tt class="docutils literal"><span class="pre">tel:</span></tt>       (telephone)</li>
<li><tt class="docutils literal"><span class="pre">tip:</span></tt>       (Transaction Internet Protocol)</li>
<li><tt class="docutils literal"><span class="pre">tn3270:</span></tt>    (Interactive 3270 emulation sessions)</li>
<li><tt class="docutils literal"><span class="pre">vemmi:</span></tt>     (versatile multimedia interface)</li>
<li><tt class="docutils literal"><span class="pre">wais:</span></tt>      (Wide Area Information Servers)</li>
<li><tt class="docutils literal"><span class="pre">z39.50r:</span></tt>   (Z39.50 Retrieval)</li>
<li><tt class="docutils literal"><span class="pre">z39.50s:</span></tt>   (Z39.50 Session)</li>
</ul>
</li>
</ul>
</div>
<div class="section" id="recursion">
<h1><a class="toc-backref" href="#id4" name="recursion">Recursion</a></h1>
<p>Recursion occurs on HTML files, Opera bookmark files and directories.
Note that the directory recursion reads all files in that
directory, not just a subset like <tt class="docutils literal"><span class="pre">index.htm*</span></tt>.</p>
</div>
<div class="section" id="frequently-asked-questions">
<h1><a class="toc-backref" href="#id5" name="frequently-asked-questions">Frequently asked questions</a></h1>
<p><strong>Q: LinkChecker produced an error, but my web page is ok with
Netscape/IE/Opera/...
Is this a bug in LinkChecker?</strong></p>
<p>A: Please check your web pages first. Are they really ok? Use
a <a class="reference" href="http://fte.sourceforge.net/">syntax highlighting editor</a>. Use <a class="reference" href="http://tidy.sourceforge.net/">HTML Tidy</a>.
Check if you are using a proxy which produces the error.</p>
<p><strong>Q: I still get an error, but the page is definitely ok.</strong></p>
<p>A: Some servers deny access of automated tools (also called robots)
like LinkChecker. This is not a bug in LinkChecker but rather a
policy by the webmaster running the website you are checking.
It might even be possible for a website to send robots different
web pages than normal browsers.</p>
<p><strong>Q: How can I tell LinkChecker which proxy to use?</strong></p>
<p>A: LinkChecker works transparently with proxies. In a Unix or Windows
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
environment variables to a URL that identifies the proxy server before
starting LinkChecker. For example</p>
<pre class="literal-block">
$ http_proxy=&quot;http://www.someproxy.com:3128&quot;
$ export http_proxy
</pre>
<p>In a Macintosh environment, LinkChecker will retrieve proxy information
from Internet Config.</p>
<p><strong>Q: The link &quot;mailto:john&#64;company.com?subject=Hello John&quot; is reported
as an error.</strong></p>
<p>A: You have to quote special characters (e.g. spaces) in the subject field.
The correct link should be &quot;mailto:...?subject=Hello%20John&quot;
Unfortunately browsers like IE and Netscape do not enforce this.</p>
<p><strong>Q: Has LinkChecker JavaScript support?</strong></p>
<p>A: No, it never will. If your page is not working without JS then your
web design is broken.
Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
an addon for your web pages.</p>
<p><strong>Q: I don't get this --extern/--intern stuff.</strong></p>
<p>A: When it comes to checking there are three types of URLs. Note
that local files are also represented als URLs (ie <a class="reference" href="file://">file://</a>). So
local files can be external URLs.</p>
<ol class="arabic simple">
<li>strict external URLs:
We do only syntax checking. Internal URLs are never strict.</li>
<li>external URLs:
Like 1), but we additionally check if they are valid by connect()ing
to them</li>
<li>internal URLs:
Like 2), but we additionally check if they are HTML pages and if so,
we descend recursively into this link and check all the links in the
HTML content.
The --recursion-level option restricts the number of such recursive
descends.</li>
</ol>
<p>LinkChecker provides four options which affect URLs to fall in one
of those three categories: --intern, --extern, --extern-strict-all and
--denyallow.
By default all URLs are internal. With --extern you specify what URLs
are external. With --intern you specify what URLs are internal.
Now imagine you have both --extern and --intern. What happens
when an URL matches both patterns? Or when it matches none? In this
situation the --denyallow option specifies the order in which we match
the URL. By default it is internal/external, with --denyallow the order is
external/internal. Either way, the first match counts, and if none matches,
the last checked category is the category for the URL.
Finally, with --extern-strict-all all external URLs are strict.</p>
<p>Oh, and just to boggle your mind: you can have more than one external
regular expression in a config file and for each of those expressions
you can specify if those matched external URLs should be strict or not.</p>
<p>An example. We don't want to check mailto urls. Then its
-i'!^mailto:'. The '!' negates an expression. With --extern-strictall,
we don't even connect to any mail hosts.</p>
<p>Another example. We check our site www.mycompany.com, don't recurse
into external links point outside from our site and want to ignore links
to hollowood.com and hullabulla.com completely.
This can only be done with a configuration entry like</p>
<pre class="literal-block">
[filtering]
extern1=hollowood.com 1
extern2=hullabulla.com 1
# the 1 means strict external ie don't even connect
</pre>
<p>and the command
<tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">--intern=www.mycompany.com</span> <span class="pre">www.mycompany.com</span></tt></p>
<p><strong>Q: Is LinkCheckers cookie feature insecure?</strong></p>
<p>A: Cookies can not store more information as is in the HTTP request itself,
so you are not giving away any more system information.
After storing however, the cookies are sent out to the server on request.
Not to every server, but only to the one who the cookie originated from!
This could be used to &quot;track&quot; subsequent requests to this server,
and this is what some people annoys (including me).
Cookies are only stored in memory. After LinkChecker finishes, they
are lost. So the tracking is restricted to the checking time.
The cookie feature is disabled as default.</p>
<p><strong>Q: I want to have my own logging class. How can I use it in LinkChecker?</strong></p>
<p>A: Currently, only a Python API lets you define new logging classes.
Define your own logging class as a subclass of StandardLogger or any other
logging class in the log module.
Then call the addLogger function in Config.Configuration to register
your new Logger.
After this append a new Logging instance to the fileoutput.</p>
<pre class="literal-block">
import linkcheck, MyLogger
log_format = 'mylog'
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
cfg = linkcheck.configuration.Configuration()
cfg.logger_add(log_format, MyLogger.MyLogger)
cfg['fileoutput'].append(cfg.logger_new(log_format, log_args))
</pre>
<p><strong>Q: LinkChecker does not ignore anchor references on caching.</strong></p>
<p><strong>Q: Some links with anchors are getting checked twice.</strong></p>
<p>A: This is not a bug.
It is common practice to believe that if an URL <tt class="docutils literal"><span class="pre">ABC#anchor1</span></tt> works then
<tt class="docutils literal"><span class="pre">ABC#anchor2</span></tt> works too. That is not specified anywhere and I have seen
server-side scripts that fail on some anchors and not on others.
This is the reason for always checking URLs with different anchors.
If you really want to disable this, use the <tt class="docutils literal"><span class="pre">--no-anchor-caching</span></tt>
option.</p>
<p><strong>Q: I see LinkChecker gets a /robots.txt file for every site it
checks. What is that about?</strong></p>
<p>A: LinkChecker follows the robots.txt exclusion standard. To avoid
misuse of LinkChecker, you cannot turn this feature off.
See the <a class="reference" href="http://www.robotstxt.org/wc/robots.html">Web Robot pages</a> and the <a class="reference" href="http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt">Spidering report</a> for more info.</p>
<p><strong>Q: Ctrl-C does not stop LinkChecker immediately. Why is that so?</strong></p>
<p>A: The Python interpreter has to wait for all threads to finish, and
this means waiting for all open sockets to close. The default timeout
for sockets is 30 seconds, hence the delay.
You can change the default socket timeout with the --timeout option.</p>
<p><strong>Q: How do I print unreachable/dead documents of my website with
LinkChecker?</strong></p>
<p>A: No can do. This would require file system access to your web
repository and access to your web server configuration.</p>
<p>You can instead store the linkchecker results in a database
and look for missing files.</p>
<p><strong>Q: How do I check HTML/XML syntax with LinkChecker?</strong></p>
<p>A: No can do. Use the <a class="reference" href="http://tidy.sourceforge.net/">HTML Tidy</a> program.</p>
</div>
</div>
<hr class="docutils footer" />
<div class="footer">
Generated on: 2005-01-11 11:18 UTC.
</div>
</body>
</html>