mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-19 07:20:26 +00:00
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@2209 e7d03fd6-7b0d-0410-9947-9c21f3af8025
304 lines
18 KiB
HTML
304 lines
18 KiB
HTML
<?xml version="1.0" encoding="utf-8" ?>
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
<meta name="generator" content="Docutils 0.3.7: http://docutils.sourceforge.net/" />
|
|
<title>Documentation</title>
|
|
<meta content="3" name="navigation.order" />
|
|
<meta content="Documentation" name="navigation.name" />
|
|
<meta content="4" name="navigation.order" />
|
|
<meta content="FAQ" name="navigation.name" />
|
|
<link rel="stylesheet" href="lc.css" type="text/css" />
|
|
<meta rel="SHORTCUT ICON" href="favicon.png" />
|
|
<link rel="stylesheet" href="navigation.css" type="text/css" />
|
|
<script type="text/javascript">
|
|
window.onload = function() {
|
|
if (top.location != location) {
|
|
top.location.href = document.location.href;
|
|
}
|
|
}
|
|
</script>
|
|
</head>
|
|
<body>
|
|
<!-- bfknav -->
|
|
<div class="navigation">
|
|
<div class="navrow" style="padding: 0em 0em 0em 1em;">
|
|
<a href="./index.html">LinkChecker</a>
|
|
<a href="./install.html">Installation</a>
|
|
<a href="./upgrading.html">Upgrading</a>
|
|
<span>Documentation</span>
|
|
<a href="./other.html">Other</a>
|
|
<a href="./robots.html">Robots</a>
|
|
|
|
</div>
|
|
</div>
|
|
<!-- /bfknav -->
|
|
<div class="document" id="documentation">
|
|
<h1 class="title">Documentation</h1>
|
|
<div class="contents topic" id="contents">
|
|
<p class="topic-title first"><a name="contents">Contents</a></p>
|
|
<ul class="simple">
|
|
<li><a class="reference" href="#basic-usage" id="id2" name="id2">Basic usage</a></li>
|
|
<li><a class="reference" href="#performed-checks" id="id3" name="id3">Performed checks</a></li>
|
|
<li><a class="reference" href="#recursion" id="id4" name="id4">Recursion</a></li>
|
|
<li><a class="reference" href="#frequently-asked-questions" id="id5" name="id5">Frequently asked questions</a></li>
|
|
</ul>
|
|
</div>
|
|
<div class="section" id="basic-usage">
|
|
<h1><a class="toc-backref" href="#id2" name="basic-usage">Basic usage</a></h1>
|
|
<p>To check an URL like <tt class="docutils literal"><span class="pre">http://www.myhomepage.org/</span></tt> it is enough to
|
|
execute <tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">http://www.myhomepage.org/</span></tt>. This will check the
|
|
complete domain of www.myhomepage.org recursively. All links pointing
|
|
outside of the domain are also checked for validity.</p>
|
|
<p>For more options, read the man page <tt class="docutils literal"><span class="pre">linkchecker(1)</span></tt> or execute
|
|
<tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">-h</span></tt>.</p>
|
|
</div>
|
|
<div class="section" id="performed-checks">
|
|
<h1><a class="toc-backref" href="#id3" name="performed-checks">Performed checks</a></h1>
|
|
<p>All URLs have to pass a preliminary syntax test. Minor quoting
|
|
mistakes will issue a warning, all other invalid syntax issues
|
|
are errors.
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.</p>
|
|
<ul>
|
|
<li><p class="first">HTTP links (<tt class="docutils literal"><span class="pre">http:</span></tt>, <tt class="docutils literal"><span class="pre">https:</span></tt>)</p>
|
|
<p>After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
Permanently moved pages issue a warning.
|
|
All final HTTP status codes other than 2xx are errors.</p>
|
|
</li>
|
|
<li><p class="first">Local files (<tt class="docutils literal"><span class="pre">file:</span></tt>)</p>
|
|
<p>A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example device files,
|
|
unreadable or non-existing files are errors.</p>
|
|
<p>File contents are checked for recursion.</p>
|
|
</li>
|
|
<li><p class="first">Mail links (<tt class="docutils literal"><span class="pre">mailto:</span></tt>)</p>
|
|
<p>A mailto: link eventually resolves to a list of email addresses.
|
|
If one address fails, the whole list will fail.
|
|
For each mail address we check the following things:</p>
|
|
<ol class="arabic simple">
|
|
<li>Look up the MX DNS records. If we found no MX record,
|
|
print an error.</li>
|
|
<li>Check if one of the mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If no host accepts SMTP, we print a warning.</li>
|
|
<li>Try to verify the address with the VRFY command. If we got
|
|
an answer, print the verified address as an info.</li>
|
|
</ol>
|
|
</li>
|
|
<li><p class="first">FTP links (<tt class="docutils literal"><span class="pre">ftp:</span></tt>)</p>
|
|
<p>For FTP links we do:</p>
|
|
<ol class="arabic simple">
|
|
<li>connect to the specified host</li>
|
|
<li>try to login with the given user and password. The default
|
|
user is <tt class="docutils literal"><span class="pre">anonymous</span></tt>, the default password is <tt class="docutils literal"><span class="pre">anonymous@</span></tt>.</li>
|
|
<li>try to change to the given directory</li>
|
|
<li>list the file with the NLST command</li>
|
|
</ol>
|
|
</li>
|
|
<li><p class="first">Gopher links (<tt class="docutils literal"><span class="pre">gopher:</span></tt>)</p>
|
|
<p>We try to send the given selector (or query) to the gopher server.</p>
|
|
</li>
|
|
<li><p class="first">Telnet links (<tt class="docutils literal"><span class="pre">telnet:</span></tt>)</p>
|
|
<p>We try to connect and if user/password are given, login to the
|
|
given telnet server.</p>
|
|
</li>
|
|
<li><p class="first">NNTP links (<tt class="docutils literal"><span class="pre">news:</span></tt>, <tt class="docutils literal"><span class="pre">snews:</span></tt>, <tt class="docutils literal"><span class="pre">nntp</span></tt>)</p>
|
|
<p>We try to connect to the given NNTP server. If a news group or
|
|
article is specified, try to request it from the server.</p>
|
|
</li>
|
|
<li><p class="first">Ignored links (<tt class="docutils literal"><span class="pre">javascript:</span></tt>, etc.)</p>
|
|
<p>An ignored link will only print a warning. No further checking
|
|
will be made.</p>
|
|
<p>Here is a complete list of recognized, but ignored links. The most
|
|
prominent of them should be JavaScript links.</p>
|
|
<ul class="simple">
|
|
<li><tt class="docutils literal"><span class="pre">acap:</span></tt> (application configuration access protocol)</li>
|
|
<li><tt class="docutils literal"><span class="pre">afs:</span></tt> (Andrew File System global file names)</li>
|
|
<li><tt class="docutils literal"><span class="pre">chrome:</span></tt> (Mozilla specific)</li>
|
|
<li><tt class="docutils literal"><span class="pre">cid:</span></tt> (content identifier)</li>
|
|
<li><tt class="docutils literal"><span class="pre">clsid:</span></tt> (Microsoft specific)</li>
|
|
<li><tt class="docutils literal"><span class="pre">data:</span></tt> (data)</li>
|
|
<li><tt class="docutils literal"><span class="pre">dav:</span></tt> (dav)</li>
|
|
<li><tt class="docutils literal"><span class="pre">fax:</span></tt> (fax)</li>
|
|
<li><tt class="docutils literal"><span class="pre">find:</span></tt> (Mozilla specific)</li>
|
|
<li><tt class="docutils literal"><span class="pre">imap:</span></tt> (internet message access protocol)</li>
|
|
<li><tt class="docutils literal"><span class="pre">isbn:</span></tt> (ISBN (int. book numbers))</li>
|
|
<li><tt class="docutils literal"><span class="pre">javascript:</span></tt> (JavaScript)</li>
|
|
<li><tt class="docutils literal"><span class="pre">ldap:</span></tt> (Lightweight Directory Access Protocol)</li>
|
|
<li><tt class="docutils literal"><span class="pre">mailserver:</span></tt> (Access to data available from mail servers)</li>
|
|
<li><tt class="docutils literal"><span class="pre">mid:</span></tt> (message identifier)</li>
|
|
<li><tt class="docutils literal"><span class="pre">mms:</span></tt> (multimedia stream)</li>
|
|
<li><tt class="docutils literal"><span class="pre">modem:</span></tt> (modem)</li>
|
|
<li><tt class="docutils literal"><span class="pre">nfs:</span></tt> (network file system protocol)</li>
|
|
<li><tt class="docutils literal"><span class="pre">opaquelocktoken:</span></tt> (opaquelocktoken)</li>
|
|
<li><tt class="docutils literal"><span class="pre">pop:</span></tt> (Post Office Protocol v3)</li>
|
|
<li><tt class="docutils literal"><span class="pre">prospero:</span></tt> (Prospero Directory Service)</li>
|
|
<li><tt class="docutils literal"><span class="pre">rsync:</span></tt> (rsync protocol)</li>
|
|
<li><tt class="docutils literal"><span class="pre">rtsp:</span></tt> (real time streaming protocol)</li>
|
|
<li><tt class="docutils literal"><span class="pre">service:</span></tt> (service location)</li>
|
|
<li><tt class="docutils literal"><span class="pre">shttp:</span></tt> (secure HTTP)</li>
|
|
<li><tt class="docutils literal"><span class="pre">sip:</span></tt> (session initiation protocol)</li>
|
|
<li><tt class="docutils literal"><span class="pre">tel:</span></tt> (telephone)</li>
|
|
<li><tt class="docutils literal"><span class="pre">tip:</span></tt> (Transaction Internet Protocol)</li>
|
|
<li><tt class="docutils literal"><span class="pre">tn3270:</span></tt> (Interactive 3270 emulation sessions)</li>
|
|
<li><tt class="docutils literal"><span class="pre">vemmi:</span></tt> (versatile multimedia interface)</li>
|
|
<li><tt class="docutils literal"><span class="pre">wais:</span></tt> (Wide Area Information Servers)</li>
|
|
<li><tt class="docutils literal"><span class="pre">z39.50r:</span></tt> (Z39.50 Retrieval)</li>
|
|
<li><tt class="docutils literal"><span class="pre">z39.50s:</span></tt> (Z39.50 Session)</li>
|
|
</ul>
|
|
</li>
|
|
</ul>
|
|
</div>
|
|
<div class="section" id="recursion">
|
|
<h1><a class="toc-backref" href="#id4" name="recursion">Recursion</a></h1>
|
|
<p>Recursion occurs on HTML files, Opera bookmark files and directories.
|
|
Note that the directory recursion reads all files in that
|
|
directory, not just a subset like <tt class="docutils literal"><span class="pre">index.htm*</span></tt>.</p>
|
|
</div>
|
|
<div class="section" id="frequently-asked-questions">
|
|
<h1><a class="toc-backref" href="#id5" name="frequently-asked-questions">Frequently asked questions</a></h1>
|
|
<p><strong>Q: LinkChecker produced an error, but my web page is ok with
|
|
Netscape/IE/Opera/...
|
|
Is this a bug in LinkChecker?</strong></p>
|
|
<p>A: Please check your web pages first. Are they really ok? Use
|
|
a <a class="reference" href="http://fte.sourceforge.net/">syntax highlighting editor</a>. Use <a class="reference" href="http://tidy.sourceforge.net/">HTML Tidy</a>.
|
|
Check if you are using a proxy which produces the error.</p>
|
|
<p><strong>Q: I still get an error, but the page is definitely ok.</strong></p>
|
|
<p>A: Some servers deny access of automated tools (also called robots)
|
|
like LinkChecker. This is not a bug in LinkChecker but rather a
|
|
policy by the webmaster running the website you are checking.
|
|
It might even be possible for a website to send robots different
|
|
web pages than normal browsers.</p>
|
|
<p><strong>Q: How can I tell LinkChecker which proxy to use?</strong></p>
|
|
<p>A: LinkChecker works transparently with proxies. In a Unix or Windows
|
|
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
|
|
environment variables to a URL that identifies the proxy server before
|
|
starting LinkChecker. For example</p>
|
|
<pre class="literal-block">
|
|
$ http_proxy="http://www.someproxy.com:3128"
|
|
$ export http_proxy
|
|
</pre>
|
|
<p>In a Macintosh environment, LinkChecker will retrieve proxy information
|
|
from Internet Config.</p>
|
|
<p><strong>Q: The link "mailto:john@company.com?subject=Hello John" is reported
|
|
as an error.</strong></p>
|
|
<p>A: You have to quote special characters (e.g. spaces) in the subject field.
|
|
The correct link should be "mailto:...?subject=Hello%20John"
|
|
Unfortunately browsers like IE and Netscape do not enforce this.</p>
|
|
<p><strong>Q: Has LinkChecker JavaScript support?</strong></p>
|
|
<p>A: No, it never will. If your page is not working without JS then your
|
|
web design is broken.
|
|
Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
|
|
an addon for your web pages.</p>
|
|
<p><strong>Q: I don't get this --extern/--intern stuff.</strong></p>
|
|
<p>A: When it comes to checking there are three types of URLs. Note
|
|
that local files are also represented als URLs (ie <a class="reference" href="file://">file://</a>). So
|
|
local files can be external URLs.</p>
|
|
<ol class="arabic simple">
|
|
<li>strict external URLs:
|
|
We do only syntax checking. Internal URLs are never strict.</li>
|
|
<li>external URLs:
|
|
Like 1), but we additionally check if they are valid by connect()ing
|
|
to them</li>
|
|
<li>internal URLs:
|
|
Like 2), but we additionally check if they are HTML pages and if so,
|
|
we descend recursively into this link and check all the links in the
|
|
HTML content.
|
|
The --recursion-level option restricts the number of such recursive
|
|
descends.</li>
|
|
</ol>
|
|
<p>LinkChecker provides four options which affect URLs to fall in one
|
|
of those three categories: --intern, --extern, --extern-strict-all and
|
|
--denyallow.
|
|
By default all URLs are internal. With --extern you specify what URLs
|
|
are external. With --intern you specify what URLs are internal.
|
|
Now imagine you have both --extern and --intern. What happens
|
|
when an URL matches both patterns? Or when it matches none? In this
|
|
situation the --denyallow option specifies the order in which we match
|
|
the URL. By default it is internal/external, with --denyallow the order is
|
|
external/internal. Either way, the first match counts, and if none matches,
|
|
the last checked category is the category for the URL.
|
|
Finally, with --extern-strict-all all external URLs are strict.</p>
|
|
<p>Oh, and just to boggle your mind: you can have more than one external
|
|
regular expression in a config file and for each of those expressions
|
|
you can specify if those matched external URLs should be strict or not.</p>
|
|
<p>An example. We don't want to check mailto urls. Then its
|
|
-i'!^mailto:'. The '!' negates an expression. With --extern-strictall,
|
|
we don't even connect to any mail hosts.</p>
|
|
<p>Another example. We check our site www.mycompany.com, don't recurse
|
|
into external links point outside from our site and want to ignore links
|
|
to hollowood.com and hullabulla.com completely.
|
|
This can only be done with a configuration entry like</p>
|
|
<pre class="literal-block">
|
|
[filtering]
|
|
extern1=hollowood.com 1
|
|
extern2=hullabulla.com 1
|
|
# the 1 means strict external ie don't even connect
|
|
</pre>
|
|
<p>and the command
|
|
<tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">--intern=www.mycompany.com</span> <span class="pre">www.mycompany.com</span></tt></p>
|
|
<p><strong>Q: Is LinkCheckers cookie feature insecure?</strong></p>
|
|
<p>A: Cookies can not store more information as is in the HTTP request itself,
|
|
so you are not giving away any more system information.
|
|
After storing however, the cookies are sent out to the server on request.
|
|
Not to every server, but only to the one who the cookie originated from!
|
|
This could be used to "track" subsequent requests to this server,
|
|
and this is what some people annoys (including me).
|
|
Cookies are only stored in memory. After LinkChecker finishes, they
|
|
are lost. So the tracking is restricted to the checking time.
|
|
The cookie feature is disabled as default.</p>
|
|
<p><strong>Q: I want to have my own logging class. How can I use it in LinkChecker?</strong></p>
|
|
<p>A: Currently, only a Python API lets you define new logging classes.
|
|
Define your own logging class as a subclass of StandardLogger or any other
|
|
logging class in the log module.
|
|
Then call the addLogger function in Config.Configuration to register
|
|
your new Logger.
|
|
After this append a new Logging instance to the fileoutput.</p>
|
|
<pre class="literal-block">
|
|
import linkcheck, MyLogger
|
|
log_format = 'mylog'
|
|
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
|
|
cfg = linkcheck.configuration.Configuration()
|
|
cfg.logger_add(log_format, MyLogger.MyLogger)
|
|
cfg['fileoutput'].append(cfg.logger_new(log_format, log_args))
|
|
</pre>
|
|
<p><strong>Q: LinkChecker does not ignore anchor references on caching.</strong></p>
|
|
<p><strong>Q: Some links with anchors are getting checked twice.</strong></p>
|
|
<p>A: This is not a bug.
|
|
It is common practice to believe that if an URL <tt class="docutils literal"><span class="pre">ABC#anchor1</span></tt> works then
|
|
<tt class="docutils literal"><span class="pre">ABC#anchor2</span></tt> works too. That is not specified anywhere and I have seen
|
|
server-side scripts that fail on some anchors and not on others.
|
|
This is the reason for always checking URLs with different anchors.
|
|
If you really want to disable this, use the <tt class="docutils literal"><span class="pre">--no-anchor-caching</span></tt>
|
|
option.</p>
|
|
<p><strong>Q: I see LinkChecker gets a /robots.txt file for every site it
|
|
checks. What is that about?</strong></p>
|
|
<p>A: LinkChecker follows the robots.txt exclusion standard. To avoid
|
|
misuse of LinkChecker, you cannot turn this feature off.
|
|
See the <a class="reference" href="http://www.robotstxt.org/wc/robots.html">Web Robot pages</a> and the <a class="reference" href="http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt">Spidering report</a> for more info.</p>
|
|
<p><strong>Q: Ctrl-C does not stop LinkChecker immediately. Why is that so?</strong></p>
|
|
<p>A: The Python interpreter has to wait for all threads to finish, and
|
|
this means waiting for all open sockets to close. The default timeout
|
|
for sockets is 30 seconds, hence the delay.
|
|
You can change the default socket timeout with the --timeout option.</p>
|
|
<p><strong>Q: How do I print unreachable/dead documents of my website with
|
|
LinkChecker?</strong></p>
|
|
<p>A: No can do. This would require file system access to your web
|
|
repository and access to your web server configuration.</p>
|
|
<p>You can instead store the linkchecker results in a database
|
|
and look for missing files.</p>
|
|
<p><strong>Q: How do I check HTML/XML syntax with LinkChecker?</strong></p>
|
|
<p>A: No can do. Use the <a class="reference" href="http://tidy.sourceforge.net/">HTML Tidy</a> program.</p>
|
|
</div>
|
|
</div>
|
|
<hr class="docutils footer" />
|
|
<div class="footer">
|
|
Generated on: 2005-01-11 11:18 UTC.
|
|
</div>
|
|
</body>
|
|
</html>
|