mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-16 22:10:26 +00:00
Updated HTML documentation
This commit is contained in:
parent
a92e13826b
commit
fb091c377f
6 changed files with 280 additions and 160 deletions
|
|
@ -6,7 +6,10 @@ all: $(HELPFILES)
|
|||
clean:
|
||||
-rm -f *.qhc *.qch
|
||||
|
||||
%.qhc: %.qhcp lcdoc.qhp
|
||||
%.html: %.txt html.header html.footer
|
||||
(cat html.header; markdown2 $<; cat html.footer) > index.html
|
||||
|
||||
%.qhc: %.qhcp lcdoc.qhp index.html
|
||||
qcollectiongenerator $< -o $@
|
||||
|
||||
favicon.ico: favicon32x32.png favicon16x16.png
|
||||
|
|
|
|||
5
doc/html/html.footer
Normal file
5
doc/html/html.footer
Normal file
|
|
@ -0,0 +1,5 @@
|
|||
<div class="footer">
|
||||
© Copyright 2009, Bastian Kleineidam.
|
||||
</div>
|
||||
</body>
|
||||
</html>
|
||||
24
doc/html/html.header
Normal file
24
doc/html/html.header
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
||||
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||||
|
||||
<html xmlns="http://www.w3.org/1999/xhtml">
|
||||
<head>
|
||||
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
||||
|
||||
<title>Check websites for broken links</title>
|
||||
<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
|
||||
<link rel="stylesheet" href="pygments.css" type="text/css" />
|
||||
<style type="text/css">
|
||||
img { border: 0; }
|
||||
</style>
|
||||
|
||||
</head>
|
||||
<body>
|
||||
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
|
||||
<table border="0"><tr>
|
||||
<td><img
|
||||
src="logo64x64.png" border="0" alt="LinkChecker"/></td>
|
||||
<td><h1>LinkChecker</h1></td>
|
||||
</tr></table>
|
||||
</div>
|
||||
|
||||
|
|
@ -8,7 +8,6 @@
|
|||
<title>Check websites for broken links</title>
|
||||
<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
|
||||
<link rel="stylesheet" href="pygments.css" type="text/css" />
|
||||
<link rel="top" title="LinkChecker" href="" />
|
||||
<style type="text/css">
|
||||
img { border: 0; }
|
||||
</style>
|
||||
|
|
@ -16,206 +15,149 @@ img { border: 0; }
|
|||
</head>
|
||||
<body>
|
||||
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
|
||||
|
||||
<table border="0"><tr>
|
||||
<td><a href=""><img
|
||||
src="logo64x64.png" border="0" alt="LinkChecker"/></a></td>
|
||||
<td><img
|
||||
src="logo64x64.png" border="0" alt="LinkChecker"/></td>
|
||||
<td><h1>LinkChecker</h1></td>
|
||||
</tr></table>
|
||||
</div>
|
||||
|
||||
<div class="document">
|
||||
<div class="documentwrapper">
|
||||
<div class="body">
|
||||
|
||||
<div class="section" id="check-websites-for-broken-links">
|
||||
<p>To check a URL like <tt class="docutils literal"><span class="pre">http://www.myhomepage.org/</span></tt> it is enough to
|
||||
execute <tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">http://www.myhomepage.org/</span></tt>. This will check the
|
||||
<h1>Documentation</h1>
|
||||
|
||||
<h2>Basic usage</h2>
|
||||
|
||||
<p>To check a URL like <code>http://www.myhomepage.org/</code> it is enough to
|
||||
execute <code>linkchecker http://www.myhomepage.org/</code>. This will check the
|
||||
complete domain of www.myhomepage.org recursively. All links pointing
|
||||
outside of the domain are also checked for validity.</p>
|
||||
</div>
|
||||
<div class="section" id="performed-checks">
|
||||
|
||||
<h2>Performed checks</h2>
|
||||
|
||||
<p>All URLs have to pass a preliminary syntax test. Minor quoting
|
||||
mistakes will issue a warning, all other invalid syntax issues
|
||||
are errors.
|
||||
After the syntax check passes, the URL is queued for connection
|
||||
checking. All connection check types are described below.</p>
|
||||
|
||||
<ul>
|
||||
<li><p class="first">HTTP links (<tt class="docutils literal"><span class="pre">http:</span></tt>, <tt class="docutils literal"><span class="pre">https:</span></tt>)</p>
|
||||
<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>
|
||||
|
||||
<p>After connecting to the given HTTP server the given path
|
||||
or query is requested. All redirections are followed, and
|
||||
if user/password is given it will be used as authorization
|
||||
when necessary.
|
||||
Permanently moved pages issue a warning.
|
||||
All final HTTP status codes other than 2xx are errors.</p>
|
||||
</li>
|
||||
<li><p class="first">Local files (<tt class="docutils literal"><span class="pre">file:</span></tt>)</p>
|
||||
All final HTTP status codes other than 2xx are errors.</p></li>
|
||||
<li><p>Local files (<code>file:</code>)</p>
|
||||
|
||||
<p>A regular, readable file that can be opened is valid. A readable
|
||||
directory is also valid. All other files, for example device files,
|
||||
unreadable or non-existing files are errors.</p>
|
||||
<p>File contents are checked for recursion.</p>
|
||||
</li>
|
||||
<li><p class="first">Mail links (<tt class="docutils literal"><span class="pre">mailto:</span></tt>)</p>
|
||||
|
||||
<p>File contents are checked for recursion.</p></li>
|
||||
<li><p>Mail links (<code>mailto:</code>)</p>
|
||||
|
||||
<p>A mailto: link eventually resolves to a list of email addresses.
|
||||
If one address fails, the whole list will fail.
|
||||
For each mail address we check the following things:</p>
|
||||
<ol class="arabic simple">
|
||||
<li>Check the adress syntax, both of the part before and after
|
||||
the @ sign.</li>
|
||||
<li>Look up the MX DNS records. If we found no MX record,
|
||||
print an error.</li>
|
||||
<li>Check if one of the mail hosts accept an SMTP connection.
|
||||
Check hosts with higher priority first.
|
||||
If no host accepts SMTP, we print a warning.</li>
|
||||
<li>Try to verify the address with the VRFY command. If we got
|
||||
an answer, print the verified address as an info.</li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><p class="first">FTP links (<tt class="docutils literal"><span class="pre">ftp:</span></tt>)</p>
|
||||
|
||||
<p>1) Check the adress syntax, both of the part before and after
|
||||
the @ sign.
|
||||
2) Look up the MX DNS records. If we found no MX record,
|
||||
print an error.
|
||||
3) Check if one of the mail hosts accept an SMTP connection.
|
||||
Check hosts with higher priority first.
|
||||
If no host accepts SMTP, we print a warning.
|
||||
4) Try to verify the address with the VRFY command. If we got
|
||||
an answer, print the verified address as an info.</p></li>
|
||||
<li><p>FTP links (<code>ftp:</code>)</p>
|
||||
|
||||
<p>For FTP links we do:</p>
|
||||
<ol class="arabic simple">
|
||||
<li>connect to the specified host</li>
|
||||
<li>try to login with the given user and password. The default
|
||||
user is <tt class="docutils literal"><span class="pre">anonymous</span></tt>, the default password is <tt class="docutils literal"><span class="pre">anonymous@</span></tt>.</li>
|
||||
<li>try to change to the given directory</li>
|
||||
<li>list the file with the NLST command</li>
|
||||
</ol>
|
||||
</li>
|
||||
<li><p class="first">Telnet links (<tt class="docutils literal"><span class="pre">telnet:</span></tt>)</p>
|
||||
|
||||
<p>1) connect to the specified host
|
||||
2) try to login with the given user and password. The default
|
||||
user is <code>anonymous</code>, the default password is <code>anonymous@</code>.
|
||||
3) try to change to the given directory
|
||||
4) list the file with the NLST command</p></li>
|
||||
<li><p>Telnet links (<code>telnet:</code>)</p>
|
||||
|
||||
<p>We try to connect and if user/password are given, login to the
|
||||
given telnet server.</p>
|
||||
</li>
|
||||
<li><p class="first">NNTP links (<tt class="docutils literal"><span class="pre">news:</span></tt>, <tt class="docutils literal"><span class="pre">snews:</span></tt>, <tt class="docutils literal"><span class="pre">nntp</span></tt>)</p>
|
||||
given telnet server.</p></li>
|
||||
<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>
|
||||
|
||||
<p>We try to connect to the given NNTP server. If a news group or
|
||||
article is specified, try to request it from the server.</p>
|
||||
</li>
|
||||
<li><p class="first">Ignored links (<tt class="docutils literal"><span class="pre">javascript:</span></tt>, etc.)</p>
|
||||
article is specified, try to request it from the server.</p></li>
|
||||
<li><p>Ignored links (<code>javascript:</code>, etc.)</p>
|
||||
|
||||
<p>An ignored link will only print a warning. No further checking
|
||||
will be made.</p>
|
||||
|
||||
<p>Here is a complete list of recognized, but ignored links. The most
|
||||
prominent of them should be JavaScript links.</p>
|
||||
<ul class="simple">
|
||||
<li><tt class="docutils literal"><span class="pre">acap:</span></tt> (application configuration access protocol)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">afs:</span></tt> (Andrew File System global file names)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">chrome:</span></tt> (Mozilla specific)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">cid:</span></tt> (content identifier)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">clsid:</span></tt> (Microsoft specific)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">data:</span></tt> (data)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">dav:</span></tt> (dav)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">fax:</span></tt> (fax)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">find:</span></tt> (Mozilla specific)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">gopher:</span></tt> (Gopher)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">imap:</span></tt> (internet message access protocol)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">isbn:</span></tt> (ISBN (int. book numbers))</li>
|
||||
<li><tt class="docutils literal"><span class="pre">javascript:</span></tt> (JavaScript)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">ldap:</span></tt> (Lightweight Directory Access Protocol)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">mailserver:</span></tt> (Access to data available from mail servers)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">mid:</span></tt> (message identifier)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">mms:</span></tt> (multimedia stream)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">modem:</span></tt> (modem)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">nfs:</span></tt> (network file system protocol)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">opaquelocktoken:</span></tt> (opaquelocktoken)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">pop:</span></tt> (Post Office Protocol v3)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">prospero:</span></tt> (Prospero Directory Service)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">rsync:</span></tt> (rsync protocol)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">rtsp:</span></tt> (real time streaming protocol)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">service:</span></tt> (service location)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">shttp:</span></tt> (secure HTTP)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">sip:</span></tt> (session initiation protocol)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">tel:</span></tt> (telephone)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">tip:</span></tt> (Transaction Internet Protocol)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">tn3270:</span></tt> (Interactive 3270 emulation sessions)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">vemmi:</span></tt> (versatile multimedia interface)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">wais:</span></tt> (Wide Area Information Servers)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">z39.50r:</span></tt> (Z39.50 Retrieval)</li>
|
||||
<li><tt class="docutils literal"><span class="pre">z39.50s:</span></tt> (Z39.50 Session)</li>
|
||||
|
||||
<ul>
|
||||
<li><code>acap:</code> (application configuration access protocol)</li>
|
||||
<li><code>afs:</code> (Andrew File System global file names)</li>
|
||||
<li><code>chrome:</code> (Mozilla specific)</li>
|
||||
<li><code>cid:</code> (content identifier)</li>
|
||||
<li><code>clsid:</code> (Microsoft specific)</li>
|
||||
<li><code>data:</code> (data)</li>
|
||||
<li><code>dav:</code> (dav)</li>
|
||||
<li><code>fax:</code> (fax)</li>
|
||||
<li><code>find:</code> (Mozilla specific)</li>
|
||||
<li><code>gopher:</code> (Gopher)</li>
|
||||
<li><code>imap:</code> (internet message access protocol)</li>
|
||||
<li><code>isbn:</code> (ISBN (int. book numbers))</li>
|
||||
<li><code>javascript:</code> (JavaScript)</li>
|
||||
<li><code>ldap:</code> (Lightweight Directory Access Protocol)</li>
|
||||
<li><code>mailserver:</code> (Access to data available from mail servers)</li>
|
||||
<li><code>mid:</code> (message identifier)</li>
|
||||
<li><code>mms:</code> (multimedia stream)</li>
|
||||
<li><code>modem:</code> (modem)</li>
|
||||
<li><code>nfs:</code> (network file system protocol)</li>
|
||||
<li><code>opaquelocktoken:</code> (opaquelocktoken)</li>
|
||||
<li><code>pop:</code> (Post Office Protocol v3)</li>
|
||||
<li><code>prospero:</code> (Prospero Directory Service)</li>
|
||||
<li><code>rsync:</code> (rsync protocol)</li>
|
||||
<li><code>rtsp:</code> (real time streaming protocol)</li>
|
||||
<li><code>service:</code> (service location)</li>
|
||||
<li><code>shttp:</code> (secure HTTP)</li>
|
||||
<li><code>sip:</code> (session initiation protocol)</li>
|
||||
<li><code>tel:</code> (telephone)</li>
|
||||
<li><code>tip:</code> (Transaction Internet Protocol)</li>
|
||||
<li><code>tn3270:</code> (Interactive 3270 emulation sessions)</li>
|
||||
<li><code>vemmi:</code> (versatile multimedia interface)</li>
|
||||
<li><code>wais:</code> (Wide Area Information Servers)</li>
|
||||
<li><code>z39.50r:</code> (Z39.50 Retrieval)</li>
|
||||
<li><code>z39.50s:</code> (Z39.50 Session)</li>
|
||||
</ul></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
</div>
|
||||
<div class="section" id="recursion">
|
||||
|
||||
<h2>Recursion</h2>
|
||||
|
||||
<p>Before descending recursively into a URL, it has to fulfill several
|
||||
conditions. They are checked in this order:</p>
|
||||
<ol class="arabic simple">
|
||||
<li>A URL must be valid.</li>
|
||||
<li>A URL must be parseable. This currently includes HTML files,
|
||||
|
||||
<ol>
|
||||
<li><p>A URL must be valid.</p></li>
|
||||
<li><p>A URL must be parseable. This currently includes HTML files,
|
||||
Opera bookmarks files, and directories. If a file type cannot
|
||||
be determined (for example it does not have a common HTML file
|
||||
extension, and the content does not look like HTML), it is assumed
|
||||
to be non-parseable.</li>
|
||||
<li>The URL content must be retrievable. This is usually the case
|
||||
except for example mailto: or unknown URL types.</li>
|
||||
<li>The maximum recursion level must not be exceeded. It is configured
|
||||
with the <tt class="docutils literal"><span class="pre">--recursion-level</span></tt> option and is unlimited per default.</li>
|
||||
<li>It must not match the ignored URL list. This is controlled with
|
||||
the <tt class="docutils literal"><span class="pre">--ignore-url</span></tt> option.</li>
|
||||
<li>The Robots Exclusion Protocol must allow links in the URL to be
|
||||
to be non-parseable.</p></li>
|
||||
<li><p>The URL content must be retrievable. This is usually the case
|
||||
except for example mailto: or unknown URL types.</p></li>
|
||||
<li><p>The maximum recursion level must not be exceeded. It is configured
|
||||
with the <code>--recursion-level</code> option and is unlimited per default.</p></li>
|
||||
<li><p>It must not match the ignored URL list. This is controlled with
|
||||
the <code>--ignore-url</code> option.</p></li>
|
||||
<li><p>The Robots Exclusion Protocol must allow links in the URL to be
|
||||
followed recursively. This is checked by searching for a
|
||||
“nofollow” directive in the HTML header data.</li>
|
||||
"nofollow" directive in the HTML header data.</p></li>
|
||||
</ol>
|
||||
|
||||
<p>Note that the directory recursion reads all files in that
|
||||
directory, not just a subset like <tt class="docutils literal"><span class="pre">index.htm*</span></tt>.</p>
|
||||
</div>
|
||||
<div class="section" id="frequently-asked-questions">
|
||||
<h2>Frequently asked questions</h2>
|
||||
<p><strong>Q: LinkChecker produced an error, but my web page is ok with
|
||||
Mozilla/IE/Opera/...
|
||||
Is this a bug in LinkChecker?</strong></p>
|
||||
<p>A: Please check your web pages first. Are they really ok?
|
||||
Use the <tt class="docutils literal"><span class="pre">--check-html</span></tt> option, or check if you are using a proxy
|
||||
which produces the error.</p>
|
||||
<p><strong>Q: I still get an error, but the page is definitely ok.</strong></p>
|
||||
<p>A: Some servers deny access of automated tools (also called robots)
|
||||
like LinkChecker. This is not a bug in LinkChecker but rather a
|
||||
policy by the webmaster running the website you are checking. Look
|
||||
the <tt class="docutils literal"><span class="pre">/robots.txt</span></tt> file which follows the <a class="reference external" href="http://www.robotstxt.org/wc/norobots-rfc.html">robots.txt exclusion standard</a>.</p>
|
||||
<p><strong>Q: How can I tell LinkChecker which proxy to use?</strong></p>
|
||||
<p>A: LinkChecker works transparently with proxies. In a Unix or Windows
|
||||
environment, set the http_proxy, https_proxy, ftp_proxy environment
|
||||
variables to a URL that identifies the proxy server before starting
|
||||
LinkChecker. For example</p>
|
||||
<div class="highlight-python"><pre>$ http_proxy="http://www.someproxy.com:3128"
|
||||
$ export http_proxy</pre>
|
||||
</div>
|
||||
<p><strong>Q: The link “mailto:john@company.com?subject=Hello John” is reported
|
||||
as an error.</strong></p>
|
||||
<p>A: You have to quote special characters (e.g. spaces) in the subject field.
|
||||
The correct link should be “mailto:...?subject=Hello%20John”
|
||||
Unfortunately browsers like IE and Netscape do not enforce this.</p>
|
||||
<p><strong>Q: Has LinkChecker JavaScript support?</strong></p>
|
||||
<p>A: No, it never will. If your page is not working without JS, it is
|
||||
better checked with a browser testing tool like <a class="reference external" href="http://seleniumhq.org/">Selenium</a>.</p>
|
||||
<p><strong>Q: Is LinkCheckers cookie feature insecure?</strong></p>
|
||||
<p>A: If a cookie file is specified, the information will be sent
|
||||
to the specified hosts.
|
||||
The following restrictions apply for LinkChecker cookies:</p>
|
||||
<ul class="simple">
|
||||
<li>Cookies will only be sent to the originating server.</li>
|
||||
<li>Cookies are only stored in memory. After LinkChecker finishes, they
|
||||
are lost.</li>
|
||||
<li>The cookie feature is disabled as default.</li>
|
||||
</ul>
|
||||
<p><strong>Q: I see LinkChecker gets a /robots.txt file for every site it
|
||||
checks. What is that about?</strong></p>
|
||||
<p>A: LinkChecker follows the <a class="reference external" href="http://www.robotstxt.org/wc/norobots-rfc.html">robots.txt exclusion standard</a>. To avoid
|
||||
misuse of LinkChecker, you cannot turn this feature off.
|
||||
See the <a class="reference external" href="http://www.robotstxt.org/wc/robots.html">Web Robot pages</a> and the <a class="reference external" href="http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt">Spidering report</a> for more info.</p>
|
||||
<p><strong>Q: How do I print unreachable/dead documents of my website with
|
||||
LinkChecker?</strong></p>
|
||||
<p>A: No can do. This would require file system access to your web
|
||||
repository and access to your web server configuration.</p>
|
||||
<p><strong>Q: How do I check HTML/XML/CSS syntax with LinkChecker?</strong></p>
|
||||
<p>A: Use the <tt class="docutils literal"><span class="pre">--check-html</span></tt> and <tt class="docutils literal"><span class="pre">--check-css</span></tt> options.</p>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
|
||||
</div>
|
||||
</div>
|
||||
<div class="clearer"></div>
|
||||
</div>
|
||||
directory, not just a subset like <code>index.htm*</code>.</p>
|
||||
<div class="footer">
|
||||
© Copyright 2009, Bastian Kleineidam.
|
||||
</div>
|
||||
|
|
|
|||
146
doc/html/index.txt
Normal file
146
doc/html/index.txt
Normal file
|
|
@ -0,0 +1,146 @@
|
|||
Documentation
|
||||
=============
|
||||
|
||||
Basic usage
|
||||
-----------
|
||||
|
||||
To check a URL like ``http://www.myhomepage.org/`` it is enough to
|
||||
execute ``linkchecker http://www.myhomepage.org/``. This will check the
|
||||
complete domain of www.myhomepage.org recursively. All links pointing
|
||||
outside of the domain are also checked for validity.
|
||||
|
||||
Performed checks
|
||||
----------------
|
||||
|
||||
All URLs have to pass a preliminary syntax test. Minor quoting
|
||||
mistakes will issue a warning, all other invalid syntax issues
|
||||
are errors.
|
||||
After the syntax check passes, the URL is queued for connection
|
||||
checking. All connection check types are described below.
|
||||
|
||||
- HTTP links (``http:``, ``https:``)
|
||||
|
||||
After connecting to the given HTTP server the given path
|
||||
or query is requested. All redirections are followed, and
|
||||
if user/password is given it will be used as authorization
|
||||
when necessary.
|
||||
Permanently moved pages issue a warning.
|
||||
All final HTTP status codes other than 2xx are errors.
|
||||
|
||||
- Local files (``file:``)
|
||||
|
||||
A regular, readable file that can be opened is valid. A readable
|
||||
directory is also valid. All other files, for example device files,
|
||||
unreadable or non-existing files are errors.
|
||||
|
||||
File contents are checked for recursion.
|
||||
|
||||
- Mail links (``mailto:``)
|
||||
|
||||
A mailto: link eventually resolves to a list of email addresses.
|
||||
If one address fails, the whole list will fail.
|
||||
For each mail address we check the following things:
|
||||
|
||||
1) Check the adress syntax, both of the part before and after
|
||||
the @ sign.
|
||||
2) Look up the MX DNS records. If we found no MX record,
|
||||
print an error.
|
||||
3) Check if one of the mail hosts accept an SMTP connection.
|
||||
Check hosts with higher priority first.
|
||||
If no host accepts SMTP, we print a warning.
|
||||
4) Try to verify the address with the VRFY command. If we got
|
||||
an answer, print the verified address as an info.
|
||||
|
||||
- FTP links (``ftp:``)
|
||||
|
||||
For FTP links we do:
|
||||
|
||||
1) connect to the specified host
|
||||
2) try to login with the given user and password. The default
|
||||
user is ``anonymous``, the default password is ``anonymous@``.
|
||||
3) try to change to the given directory
|
||||
4) list the file with the NLST command
|
||||
|
||||
- Telnet links (``telnet:``)
|
||||
|
||||
We try to connect and if user/password are given, login to the
|
||||
given telnet server.
|
||||
|
||||
- NNTP links (``news:``, ``snews:``, ``nntp``)
|
||||
|
||||
We try to connect to the given NNTP server. If a news group or
|
||||
article is specified, try to request it from the server.
|
||||
|
||||
- Ignored links (``javascript:``, etc.)
|
||||
|
||||
An ignored link will only print a warning. No further checking
|
||||
will be made.
|
||||
|
||||
Here is a complete list of recognized, but ignored links. The most
|
||||
prominent of them should be JavaScript links.
|
||||
|
||||
- ``acap:`` (application configuration access protocol)
|
||||
- ``afs:`` (Andrew File System global file names)
|
||||
- ``chrome:`` (Mozilla specific)
|
||||
- ``cid:`` (content identifier)
|
||||
- ``clsid:`` (Microsoft specific)
|
||||
- ``data:`` (data)
|
||||
- ``dav:`` (dav)
|
||||
- ``fax:`` (fax)
|
||||
- ``find:`` (Mozilla specific)
|
||||
- ``gopher:`` (Gopher)
|
||||
- ``imap:`` (internet message access protocol)
|
||||
- ``isbn:`` (ISBN (int. book numbers))
|
||||
- ``javascript:`` (JavaScript)
|
||||
- ``ldap:`` (Lightweight Directory Access Protocol)
|
||||
- ``mailserver:`` (Access to data available from mail servers)
|
||||
- ``mid:`` (message identifier)
|
||||
- ``mms:`` (multimedia stream)
|
||||
- ``modem:`` (modem)
|
||||
- ``nfs:`` (network file system protocol)
|
||||
- ``opaquelocktoken:`` (opaquelocktoken)
|
||||
- ``pop:`` (Post Office Protocol v3)
|
||||
- ``prospero:`` (Prospero Directory Service)
|
||||
- ``rsync:`` (rsync protocol)
|
||||
- ``rtsp:`` (real time streaming protocol)
|
||||
- ``service:`` (service location)
|
||||
- ``shttp:`` (secure HTTP)
|
||||
- ``sip:`` (session initiation protocol)
|
||||
- ``tel:`` (telephone)
|
||||
- ``tip:`` (Transaction Internet Protocol)
|
||||
- ``tn3270:`` (Interactive 3270 emulation sessions)
|
||||
- ``vemmi:`` (versatile multimedia interface)
|
||||
- ``wais:`` (Wide Area Information Servers)
|
||||
- ``z39.50r:`` (Z39.50 Retrieval)
|
||||
- ``z39.50s:`` (Z39.50 Session)
|
||||
|
||||
|
||||
Recursion
|
||||
---------
|
||||
|
||||
Before descending recursively into a URL, it has to fulfill several
|
||||
conditions. They are checked in this order:
|
||||
|
||||
1. A URL must be valid.
|
||||
|
||||
2. A URL must be parseable. This currently includes HTML files,
|
||||
Opera bookmarks files, and directories. If a file type cannot
|
||||
be determined (for example it does not have a common HTML file
|
||||
extension, and the content does not look like HTML), it is assumed
|
||||
to be non-parseable.
|
||||
|
||||
3. The URL content must be retrievable. This is usually the case
|
||||
except for example mailto: or unknown URL types.
|
||||
|
||||
4. The maximum recursion level must not be exceeded. It is configured
|
||||
with the ``--recursion-level`` option and is unlimited per default.
|
||||
|
||||
5. It must not match the ignored URL list. This is controlled with
|
||||
the ``--ignore-url`` option.
|
||||
|
||||
6. The Robots Exclusion Protocol must allow links in the URL to be
|
||||
followed recursively. This is checked by searching for a
|
||||
"nofollow" directive in the HTML header data.
|
||||
|
||||
Note that the directory recursion reads all files in that
|
||||
directory, not just a subset like ``index.htm*``.
|
||||
0
doc/html/logo64x64.png
Executable file → Normal file
0
doc/html/logo64x64.png
Executable file → Normal file
|
Before Width: | Height: | Size: 8 KiB After Width: | Height: | Size: 8 KiB |
Loading…
Reference in a new issue