linkchecker/doc/html/index.html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>Check websites for broken links</title>
    <link rel="stylesheet" href="sphinxdoc.css" type="text/css" />
    <link rel="stylesheet" href="pygments.css" type="text/css" />
<style type="text/css">
img { border: 0; }
</style>

  </head>
  <body>
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
<table border="0"><tr>
 <td><img
  src="logo64x64.png" border="0" alt="LinkChecker"/></td>
 <td><h1>LinkChecker</h1></td>
</tr></table>
</div>

<h1>Documentation</h1>

<h2>Basic usage</h2>

<p>To check a URL like <a href="http://www.example.org/">http://www.example.org/</a>
it is enough to
type <code>linkchecker www.example.org</code> on the command line or
type <code>www.example.org</code> in the GUI application. This will check the
complete domain of <code>http://www.example.org</code> recursively. All links
pointing outside of the domain are also checked for validity.</p>

<p>Local files can also be checked. On Unix or OSX systems the syntax
is <code>file:///path/to/my/file.html</code>. On Windows the syntax is
<code>file://C|/path/to/my/file.html</code>. When directories are checked,
all included files will be checked.</p>

<p>On the GUI client the <code>Edit</code> menu has shortcuts for bookmark
files. For example if Google Chrome is installed, there will be
a menu entry called <code>Insert Google Chrome bookmark file</code> which
can be used to check all browser bookmarks.</p>

<h2>Options</h2>

<p>The commandline client options are documented
in the <a href="http://wummel.github.io/linkchecker/man1/linkchecker.1.html">linkchecker(1) manual page</a>.</p>

<p>In the GUI client, the following options are available:</p>

<ul>
<li><p>Recursive depth</p>

<p>Check recursively all links up to given depth.
A negative depth (eg. <code>-1</code>) will enable infinite recursion.</p></li>
<li><p>Verbose output</p>

<p>If set, log all checked URLs. Default is to log only errors and warnings.</p></li>
<li><p>Debug</p>

<p>Prints debugging output in a separate window which can be seen with
<code>Help -&gt; Show debug</code>.</p></li>
<li><p>Debug memory usage</p>

<p>Profiles memory usage and writes statistics and a dump file when checking
stops. The dump file can be examined with
<a href="https://github.com/wummel/linkchecker/blob/master/tests/analyze_memdump.py">external tools</a>.
This option should only be useful for developers.</p></li>
<li><p>Warning strings</p>

<p>Log a warning if any strings are found in the content of the checked
URL. Strings are entered one per line.</p>

<p>Use this to check for pages that contain some form of error, for example
"<code>This page has moved</code>" or "<code>Oracle Application error</code>".</p></li>
<li><p>Ignoring URLs</p>

<p>URLs matching the given regular expressions will be ignored and not checked.
Useful if certain URL types should not be checked like emails (ie.
"<code>^mailto:</code>").</p></li>
</ul>

<h2>Configuration file</h2>

<p>Each user can edit a configuration with advanced options for
checking or filtering. The
<a href="http://wummel.github.io/linkchecker/man5/linkcheckerrc.5.html">linkcheckerrc(5) manual page</a>
documents all the options.</p>

<p>In the GUI client the configuration file can be edited directly from
the dialog <code>Edit -&gt; Options</code> and the clicking on <code>Edit</code>.</p>

<h2>Performed checks</h2>

<p>All URLs have to pass a preliminary syntax test.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.</p>

<ul>
<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>

<p>After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages (status code 301) issue a warning.
All final HTTP status codes other than 2xx are errors.</p>

<p>For HTTPS links, the SSL certificate is checked against the
given hostname. If it does not match, a warnings is printed.</p></li>
<li><p>Local files (<code>file:</code>)</p>

<p>A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example unreadable,
non-existing or device files are errors.</p>

<p>File contents are checked for recursion. If they are parseable
files (for example HTML files), all links in that file will be
checked.</p></li>
<li><p>Mail links (<code>mailto:</code>)</p>

<p>A mailto: link resolves to a list of email addresses.
If one email address fails the whole list will fail.
For each email address the following things are checked:</p>

<ol>
<li>Check the address syntax, both the part before and after
the @ sign.</li>
<li>Look up the MX DNS records. If no MX record is found,
print an error.</li>
<li>Check if one of the MX mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If none of the hosts accept SMTP, a warning is printed.</li>
<li>Try to verify the address with the VRFY command. If there is
an answer, the verified address is printed as an info.</li>
</ol></li>
<li><p>FTP links (<code>ftp:</code>)</p>

<p>For FTP links the following is checked:</p>

<ol>
<li>Connect to the specified host.</li>
<li>Try to login with the given user and password. The default
user is <code>anonymous</code>, the default password is <code>anonymous@</code>.</li>
<li>Try to change to the given directory.</li>
<li>List the file with the NLST command.</li>
</ol></li>
<li><p>Telnet links (<code>telnet:</code>)</p>

<p>A connect and if user/password are given, login to the
given telnet server is tried.</p></li>
<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>

<p>A connect is tried to connect to the given NNTP server. If a news group or
article is specified, it will be requested from the server.</p></li>
<li><p>Unsupported links (<code>javascript:</code>, etc.)</p>

<p>An unsupported link will print a warning, but no error. No further checking
will be made.</p>

<p>The complete list of recognized, but unsupported links can be seen in the
<a href="https://github.com/wummel/linkchecker/blob/master/linkcheck/checker/unknownurl.py">unknownurl.py</a>
source file. The most prominent of them are JavaScript links.</p></li>
</ul>

<h2>Recursion</h2>

<p>Before descending recursively into a URL, it has to fulfill several
conditions. The conditions are checked in this order:</p>

<ol>
<li>The URL must be valid.</li>
<li>The URL must be parseable. This currently includes HTML files,
Bookmarks files (Opera, Chrome or Safari), directories and on
Windows systems MS Word files if Word and the Pywin32 module
is installed on your system.
If a file type cannot be determined (for example it does not have
a common HTML file extension, and the content does not look like
HTML), it is assumed to be non-parseable.</li>
<li>The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.</li>
<li>The maximum recursion level must not be exceeded. It is configured
with the <code>--recursion-level</code> command line option, the recursion
level GUI option, or through the configuration file.
The recursion level is unlimited by default.</li>
<li>It must not match the ignored URL list. This is controlled with
the <code>--ignore-url</code> command line option or through the
configuration file.</li>
<li>The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by evaluating the servers
robots.txt file and searching for a "nofollow" directive in the
HTML header data.</li>
</ol>

<p>Note that the local and FTP directory recursion reads all files in that
directory, not just a subset like <code>index.htm*</code>.</p>
<hr/>
    <div class="footer">
      &copy; Copyright 2012-2014, Bastian Kleineidam.
    </div>
  </body>
</html>
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"`
			`"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">`

			`<html xmlns="http://www.w3.org/1999/xhtml">`
			`<head>`
			`<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />`

			`<title>Check websites for broken links</title>`
Moved HTML documentation files 2009-07-26 21:31:48 +00:00			`<link rel="stylesheet" href="sphinxdoc.css" type="text/css" />`
			`<link rel="stylesheet" href="pygments.css" type="text/css" />`
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<style type="text/css">`
			`img { border: 0; }`
			`</style>`

			`</head>`
			`<body>`
			`<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">`
			`<table border="0"><tr>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`<td><img`
			`src="logo64x64.png" border="0" alt="LinkChecker"/></td>`
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<td><h1>LinkChecker</h1></td>`
			`</tr></table>`
			`</div>`

Updated HTML documentation 2009-08-19 18:41:17 +00:00			`<h1>Documentation</h1>`

			`<h2>Basic usage</h2>`

Improved the GUI documentation. 2012-11-08 10:15:32 +00:00			`<p>To check a URL like <a href="http://www.example.org/">http://www.example.org/</a>`
			`it is enough to`
Updated documentation. 2011-10-19 18:08:27 +00:00			`type <code>linkchecker www.example.org</code> on the command line or`
Updated documentation. 2010-12-07 19:59:18 +00:00			`type <code>www.example.org</code> in the GUI application. This will check the`
			`complete domain of <code>http://www.example.org</code> recursively. All links`
			`pointing outside of the domain are also checked for validity.</p>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Improved the GUI documentation. 2012-11-08 10:15:32 +00:00			`<p>Local files can also be checked. On Unix or OSX systems the syntax`
			`is <code>file:///path/to/my/file.html</code>. On Windows the syntax is`
			`<code>file://C\|/path/to/my/file.html</code>. When directories are checked,`
			`all included files will be checked.</p>`

			`<p>On the GUI client the <code>Edit</code> menu has shortcuts for bookmark`
			`files. For example if Google Chrome is installed, there will be`
			`a menu entry called <code>Insert Google Chrome bookmark file</code> which`
			`can be used to check all browser bookmarks.</p>`

			`<h2>Options</h2>`

			`<p>The commandline client options are documented`
Updated homepage URL. 2013-04-09 18:11:04 +00:00			`in the <a href="http://wummel.github.io/linkchecker/man1/linkchecker.1.html">linkchecker(1) manual page</a>.</p>`
Improved the GUI documentation. 2012-11-08 10:15:32 +00:00
			`<p>In the GUI client, the following options are available:</p>`

			`<ul>`
			`<li><p>Recursive depth</p>`

			`<p>Check recursively all links up to given depth.`
			`A negative depth (eg. <code>-1</code>) will enable infinite recursion.</p></li>`
			`<li><p>Verbose output</p>`

			`<p>If set, log all checked URLs. Default is to log only errors and warnings.</p></li>`
			`<li><p>Debug</p>`

			`<p>Prints debugging output in a separate window which can be seen with`
			`<code>Help -> Show debug</code>.</p></li>`
			`<li><p>Debug memory usage</p>`

			`<p>Profiles memory usage and writes statistics and a dump file when checking`
			`stops. The dump file can be examined with`
Updated all links. 2013-01-06 17:10:13 +00:00			`<a href="https://github.com/wummel/linkchecker/blob/master/tests/analyze_memdump.py">external tools</a>.`
Improved the GUI documentation. 2012-11-08 10:15:32 +00:00			`This option should only be useful for developers.</p></li>`
			`<li><p>Warning strings</p>`

			`<p>Log a warning if any strings are found in the content of the checked`
			`URL. Strings are entered one per line.</p>`

			`<p>Use this to check for pages that contain some form of error, for example`
			`"<code>This page has moved</code>" or "<code>Oracle Application error</code>".</p></li>`
			`<li><p>Ignoring URLs</p>`

			`<p>URLs matching the given regular expressions will be ignored and not checked.`
			`Useful if certain URL types should not be checked like emails (ie.`
			`"<code>^mailto:</code>").</p></li>`
			`</ul>`

			`<h2>Configuration file</h2>`

			`<p>Each user can edit a configuration with advanced options for`
			`checking or filtering. The`
Updated homepage URL. 2013-04-09 18:11:04 +00:00			`<a href="http://wummel.github.io/linkchecker/man5/linkcheckerrc.5.html">linkcheckerrc(5) manual page</a>`
Improved the GUI documentation. 2012-11-08 10:15:32 +00:00			`documents all the options.</p>`

			`<p>In the GUI client the configuration file can be edited directly from`
			`the dialog <code>Edit -> Options</code> and the clicking on <code>Edit</code>.</p>`

Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<h2>Performed checks</h2>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Updated the documentation. 2012-06-18 18:19:48 +00:00			`<p>All URLs have to pass a preliminary syntax test.`
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`After the syntax check passes, the URL is queued for connection`
			`checking. All connection check types are described below.</p>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<ul>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`<li><p>HTTP links (<code>http:</code>, <code>https:</code>)</p>`

Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<p>After connecting to the given HTTP server the given path`
			`or query is requested. All redirections are followed, and`
			`if user/password is given it will be used as authorization`
			`when necessary.`
Updated the documentation. 2012-06-18 18:19:48 +00:00			`Permanently moved pages (status code 301) issue a warning.`
Add SSL certificate verification. 2012-06-18 21:05:44 +00:00			`All final HTTP status codes other than 2xx are errors.</p>`

			`<p>For HTTPS links, the SSL certificate is checked against the`
			`given hostname. If it does not match, a warnings is printed.</p></li>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`<li><p>Local files (<code>file:</code>)</p>`

Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<p>A regular, readable file that can be opened is valid. A readable`
Updated documentation. 2011-10-19 18:08:27 +00:00			`directory is also valid. All other files, for example unreadable,`
			`non-existing or device files are errors.</p>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Updated documentation. 2011-10-19 18:08:27 +00:00			`<p>File contents are checked for recursion. If they are parseable`
			`files (for example HTML files), all links in that file will be`
			`checked.</p></li>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`<li><p>Mail links (<code>mailto:</code>)</p>`

Updated documentation. 2011-10-19 18:08:27 +00:00			`<p>A mailto: link resolves to a list of email addresses.`
Updated the documentation. 2012-06-18 18:19:48 +00:00			`If one email address fails the whole list will fail.`
			`For each email address the following things are checked:</p>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Improved formatting. 2011-02-11 11:48:57 +00:00			`<ol>`
Updated the documentation. 2012-06-18 18:19:48 +00:00			`<li>Check the address syntax, both the part before and after`
Improved formatting. 2011-02-11 11:48:57 +00:00			`the @ sign.</li>`
			`<li>Look up the MX DNS records. If no MX record is found,`
			`print an error.</li>`
Updated documentation. 2011-10-19 18:08:27 +00:00			`<li>Check if one of the MX mail hosts accept an SMTP connection.`
Improved formatting. 2011-02-11 11:48:57 +00:00			`Check hosts with higher priority first.`
Updated documentation. 2011-10-19 18:08:27 +00:00			`If none of the hosts accept SMTP, a warning is printed.</li>`
Improved formatting. 2011-02-11 11:48:57 +00:00			`<li>Try to verify the address with the VRFY command. If there is`
			`an answer, the verified address is printed as an info.</li>`
			`</ol></li>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`<li><p>FTP links (<code>ftp:</code>)</p>`

Updated documentation and HTML layout. 2010-04-01 05:39:31 +00:00			`<p>For FTP links the following is checked:</p>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Improved formatting. 2011-02-11 11:48:57 +00:00			`<ol>`
			`<li>Connect to the specified host.</li>`
			`<li>Try to login with the given user and password. The default`
			`user is <code>anonymous</code>, the default password is <code>anonymous@</code>.</li>`
			`<li>Try to change to the given directory.</li>`
			`<li>List the file with the NLST command.</li>`
			`</ol></li>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`<li><p>Telnet links (<code>telnet:</code>)</p>`

Updated documentation and HTML layout. 2010-04-01 05:39:31 +00:00			`<p>A connect and if user/password are given, login to the`
			`given telnet server is tried.</p></li>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`<li><p>NNTP links (<code>news:</code>, <code>snews:</code>, <code>nntp</code>)</p>`

Updated documentation and HTML layout. 2010-04-01 05:39:31 +00:00			`<p>A connect is tried to connect to the given NNTP server. If a news group or`
			`article is specified, it will be requested from the server.</p></li>`
Document that feed: URLs are unsupported. 2012-08-14 21:15:41 +00:00			`<li><p>Unsupported links (<code>javascript:</code>, etc.)</p>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Document that feed: URLs are unsupported. 2012-08-14 21:15:41 +00:00			`<p>An unsupported link will print a warning, but no error. No further checking`
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`will be made.</p>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Improved the GUI documentation. 2012-11-08 10:15:32 +00:00			`<p>The complete list of recognized, but unsupported links can be seen in the`
Updated all links. 2013-01-06 17:10:13 +00:00			`<a href="https://github.com/wummel/linkchecker/blob/master/linkcheck/checker/unknownurl.py">unknownurl.py</a>`
Improved the GUI documentation. 2012-11-08 10:15:32 +00:00			`source file. The most prominent of them are JavaScript links.</p></li>`
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`</ul>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<h2>Recursion</h2>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<p>Before descending recursively into a URL, it has to fulfill several`
Minor documentation improvements. 2011-01-31 09:48:32 +00:00			`conditions. The conditions are checked in this order:</p>`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			`<ol>`
Improved formatting. 2011-02-11 11:48:57 +00:00			`<li>The URL must be valid.</li>`
			`<li>The URL must be parseable. This currently includes HTML files,`
Updated the documentation. 2012-06-18 18:19:48 +00:00			`Bookmarks files (Opera, Chrome or Safari), directories and on`
			`Windows systems MS Word files if Word and the Pywin32 module`
			`is installed on your system.`
			`If a file type cannot be determined (for example it does not have`
			`a common HTML file extension, and the content does not look like`
			`HTML), it is assumed to be non-parseable.</li>`
Improved formatting. 2011-02-11 11:48:57 +00:00			`<li>The URL content must be retrievable. This is usually the case`
			`except for example mailto: or unknown URL types.</li>`
			`<li>The maximum recursion level must not be exceeded. It is configured`
Minor documentation improvements. 2011-01-31 09:48:32 +00:00			`with the <code>--recursion-level</code> command line option, the recursion`
			`level GUI option, or through the configuration file.`
Improved formatting. 2011-02-11 11:48:57 +00:00			`The recursion level is unlimited by default.</li>`
			`<li>It must not match the ignored URL list. This is controlled with`
Minor documentation improvements. 2011-01-31 09:48:32 +00:00			`the <code>--ignore-url</code> command line option or through the`
Improved formatting. 2011-02-11 11:48:57 +00:00			`configuration file.</li>`
			`<li>The Robots Exclusion Protocol must allow links in the URL to be`
Updated documentation. 2011-10-19 18:08:27 +00:00			`followed recursively. This is checked by evaluating the servers`
			`robots.txt file and searching for a "nofollow" directive in the`
			`HTML header data.</li>`
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`</ol>`

Updated documentation. 2011-02-14 20:06:49 +00:00			`<p>Note that the local and FTP directory recursion reads all files in that`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`directory, not just a subset like <code>index.htm*</code>.</p>`
Added option documentation. 2010-12-07 20:49:28 +00:00			`<hr/>`
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`<div class="footer">`
Updated copyright, set release date [ci skip] 2014-01-08 21:35:36 +00:00			`© Copyright 2012-2014, Bastian Kleineidam.`
Use direct HTML documentation for the GUI client; moved the homepage content to a separate package. 2009-07-20 16:32:54 +00:00			`</div>`
			`</body>`
			`</html>`