linkchecker/doc/html/index.txt

# Documentation

## Basic usage

To check a URL like ``http://www.example.org/`` it is enough to
type ``linkchecker www.example.org`` on the command line or
type ``www.example.org`` in the GUI application. This will check the
complete domain of ``http://www.example.org`` recursively. All links
pointing outside of the domain are also checked for validity.

## Performed checks

All URLs have to pass a preliminary syntax test. Minor quoting
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.

- HTTP links (``http:``, ``https:``)
  
  After connecting to the given HTTP server the given path
  or query is requested. All redirections are followed, and
  if user/password is given it will be used as authorization
  when necessary.
  Permanently moved pages issue a warning.
  All final HTTP status codes other than 2xx are errors.

- Local files (``file:``)
  
  A regular, readable file that can be opened is valid. A readable
  directory is also valid. All other files, for example unreadable,
  non-existing or device files are errors.
  
  File contents are checked for recursion. If they are parseable
  files (for example HTML files), all links in that file will be
  checked.
  
- Mail links (``mailto:``)
  
  A mailto: link resolves to a list of email addresses.
  If one address fails the whole list will fail.
  For each mail address the following things are checked:
  
  1. Check the adress syntax, both the part before and after
     the @ sign.
  2. Look up the MX DNS records. If no MX record is found,
     print an error.
  3. Check if one of the MX mail hosts accept an SMTP connection.
     Check hosts with higher priority first.
     If none of the hosts accept SMTP, a warning is printed.
  4. Try to verify the address with the VRFY command. If there is
     an answer, the verified address is printed as an info.

- FTP links (``ftp:``)
  
  For FTP links the following is checked:
  
  1. Connect to the specified host.
  2. Try to login with the given user and password. The default
     user is ``anonymous``, the default password is ``anonymous@``.
  3. Try to change to the given directory.
  4. List the file with the NLST command.

- Telnet links (``telnet:``)
  
  A connect and if user/password are given, login to the
  given telnet server is tried.

- NNTP links (``news:``, ``snews:``, ``nntp``)
  
  A connect is tried to connect to the given NNTP server. If a news group or
  article is specified, it will be requested from the server.

- Ignored links (``javascript:``, etc.)
  
  An ignored link will print a warning, but no error. No further checking
  will be made.
  
  Here is the complete list of recognized, but ignored links. The most
  prominent of them are JavaScript links.
  
  - ``acap:``      (application configuration access protocol)
  - ``afs:``       (Andrew File System global file names)
  - ``chrome:``    (Mozilla specific)
  - ``cid:``       (content identifier)
  - ``clsid:``     (Microsoft specific)
  - ``data:``      (data)
  - ``dav:``       (dav)
  - ``fax:``       (fax)
  - ``find:``      (Mozilla specific)
  - ``gopher:``    (Gopher)
  - ``imap:``      (internet message access protocol)
  - ``irc:``       (internet relay chat)
  - ``isbn:``      (ISBN (int. book numbers))
  - ``javascript:`` (JavaScript)
  - ``ldap:``      (Lightweight Directory Access Protocol)
  - ``mailserver:`` (Access to data available from mail servers)
  - ``mid:``       (message identifier)
  - ``mms:``       (multimedia stream)
  - ``modem:``     (modem)
  - ``nfs:``       (network file system protocol)
  - ``opaquelocktoken:`` (opaquelocktoken)
  - ``pop:``       (Post Office Protocol v3)
  - ``prospero:``  (Prospero Directory Service)
  - ``rsync:``     (rsync protocol)
  - ``rtsp:``      (real time streaming protocol)
  - ``service:``   (service location)
  - ``shttp:``     (secure HTTP)
  - ``sip:``       (session initiation protocol)
  - ``tel:``       (telephone)
  - ``tip:``       (Transaction Internet Protocol)
  - ``tn3270:``    (Interactive 3270 emulation sessions)
  - ``vemmi:``     (versatile multimedia interface)
  - ``wais:``      (Wide Area Information Servers)
  - ``z39.50r:``   (Z39.50 Retrieval)
  - ``z39.50s:``   (Z39.50 Session)


## Recursion

Before descending recursively into a URL, it has to fulfill several
conditions. The conditions are checked in this order:

1. The URL must be valid.
2. The URL must be parseable. This currently includes HTML files,
   Opera bookmarks files, directories and on Windows systems MS Word
   files if Word is installed on your system. If a file type cannot
   be determined (for example it does not have a common HTML file
   extension, and the content does not look like HTML), it is assumed
   to be non-parseable.
3. The URL content must be retrievable. This is usually the case
   except for example mailto: or unknown URL types.
4. The maximum recursion level must not be exceeded. It is configured
   with the ``--recursion-level`` command line option, the recursion
   level GUI option, or through the configuration file.
   The recursion level is unlimited by default.
5. It must not match the ignored URL list. This is controlled with
   the ``--ignore-url`` command line option or through the
   configuration file.
6. The Robots Exclusion Protocol must allow links in the URL to be
   followed recursively. This is checked by evaluating the servers
   robots.txt file and searching for a "nofollow" directive in the
   HTML header data.

Note that the local and FTP directory recursion reads all files in that
directory, not just a subset like ``index.htm*``.


## Configuration file

Each user can edit a configuration with advanced options for
checking or filtering.

On Unix or OS X systems the user configuration file is at

- ``~/.linkchecker/linkcheckerrc``

On Windows the user configuration file is at

- ``%HOMEPATH%\.linkchecker\linkcheckerrc``
Added option documentation. 2010-12-07 20:49:28 +00:00			`# Documentation`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Added option documentation. 2010-12-07 20:49:28 +00:00			`## Basic usage`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Updated documentation. 2010-12-07 19:59:18 +00:00			To check a URL like ``http://www.example.org/`` it is enough to
Updated documentation. 2011-10-19 18:08:27 +00:00			type ``linkchecker www.example.org`` on the command line or
Updated documentation. 2010-12-07 19:59:18 +00:00			type ``www.example.org`` in the GUI application. This will check the
			complete domain of ``http://www.example.org`` recursively. All links
			`pointing outside of the domain are also checked for validity.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Added option documentation. 2010-12-07 20:49:28 +00:00			`## Performed checks`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			`All URLs have to pass a preliminary syntax test. Minor quoting`
			`After the syntax check passes, the URL is queued for connection`
			`checking. All connection check types are described below.`

			- HTTP links (``http:``, ``https:``)

			`After connecting to the given HTTP server the given path`
			`or query is requested. All redirections are followed, and`
			`if user/password is given it will be used as authorization`
			`when necessary.`
			`Permanently moved pages issue a warning.`
			`All final HTTP status codes other than 2xx are errors.`

			- Local files (``file:``)

			`A regular, readable file that can be opened is valid. A readable`
Updated documentation. 2011-10-19 18:08:27 +00:00			`directory is also valid. All other files, for example unreadable,`
			`non-existing or device files are errors.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Updated documentation. 2011-10-19 18:08:27 +00:00			`File contents are checked for recursion. If they are parseable`
			`files (for example HTML files), all links in that file will be`
			`checked.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			- Mail links (``mailto:``)

Updated documentation. 2011-10-19 18:08:27 +00:00			`A mailto: link resolves to a list of email addresses.`
			`If one address fails the whole list will fail.`
Updated documentation and HTML layout. 2010-04-01 05:39:31 +00:00			`For each mail address the following things are checked:`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Updated documentation. 2011-10-19 18:08:27 +00:00			`1. Check the adress syntax, both the part before and after`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`the @ sign.`
Improved formatting. 2011-02-11 11:48:57 +00:00			`2. Look up the MX DNS records. If no MX record is found,`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`print an error.`
Updated documentation. 2011-10-19 18:08:27 +00:00			`3. Check if one of the MX mail hosts accept an SMTP connection.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`Check hosts with higher priority first.`
Updated documentation. 2011-10-19 18:08:27 +00:00			`If none of the hosts accept SMTP, a warning is printed.`
Improved formatting. 2011-02-11 11:48:57 +00:00			`4. Try to verify the address with the VRFY command. If there is`
Updated documentation and HTML layout. 2010-04-01 05:39:31 +00:00			`an answer, the verified address is printed as an info.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			- FTP links (``ftp:``)

Updated documentation and HTML layout. 2010-04-01 05:39:31 +00:00			`For FTP links the following is checked:`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Improved formatting. 2011-02-11 11:48:57 +00:00			`1. Connect to the specified host.`
			`2. Try to login with the given user and password. The default`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			user is ``anonymous``, the default password is ``anonymous@``.
Improved formatting. 2011-02-11 11:48:57 +00:00			`3. Try to change to the given directory.`
			`4. List the file with the NLST command.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			- Telnet links (``telnet:``)

Updated documentation and HTML layout. 2010-04-01 05:39:31 +00:00			`A connect and if user/password are given, login to the`
			`given telnet server is tried.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			- NNTP links (``news:``, ``snews:``, ``nntp``)

Updated documentation and HTML layout. 2010-04-01 05:39:31 +00:00			`A connect is tried to connect to the given NNTP server. If a news group or`
			`article is specified, it will be requested from the server.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			- Ignored links (``javascript:``, etc.)

Minor documentation improvements. 2011-01-31 09:48:32 +00:00			`An ignored link will print a warning, but no error. No further checking`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`will be made.`

Minor documentation improvements. 2011-01-31 09:48:32 +00:00			`Here is the complete list of recognized, but ignored links. The most`
			`prominent of them are JavaScript links.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			- ``acap:`` (application configuration access protocol)
			- ``afs:`` (Andrew File System global file names)
			- ``chrome:`` (Mozilla specific)
			- ``cid:`` (content identifier)
			- ``clsid:`` (Microsoft specific)
			- ``data:`` (data)
			- ``dav:`` (dav)
			- ``fax:`` (fax)
			- ``find:`` (Mozilla specific)
			- ``gopher:`` (Gopher)
			- ``imap:`` (internet message access protocol)
Ignore irc:// URLs. 2010-11-10 18:56:31 +00:00			- ``irc:`` (internet relay chat)
Updated HTML documentation 2009-08-19 18:41:17 +00:00			- ``isbn:`` (ISBN (int. book numbers))
			- ``javascript:`` (JavaScript)
			- ``ldap:`` (Lightweight Directory Access Protocol)
			- ``mailserver:`` (Access to data available from mail servers)
			- ``mid:`` (message identifier)
			- ``mms:`` (multimedia stream)
			- ``modem:`` (modem)
			- ``nfs:`` (network file system protocol)
			- ``opaquelocktoken:`` (opaquelocktoken)
			- ``pop:`` (Post Office Protocol v3)
			- ``prospero:`` (Prospero Directory Service)
			- ``rsync:`` (rsync protocol)
			- ``rtsp:`` (real time streaming protocol)
			- ``service:`` (service location)
			- ``shttp:`` (secure HTTP)
			- ``sip:`` (session initiation protocol)
			- ``tel:`` (telephone)
			- ``tip:`` (Transaction Internet Protocol)
			- ``tn3270:`` (Interactive 3270 emulation sessions)
			- ``vemmi:`` (versatile multimedia interface)
			- ``wais:`` (Wide Area Information Servers)
			- ``z39.50r:`` (Z39.50 Retrieval)
			- ``z39.50s:`` (Z39.50 Session)


Added option documentation. 2010-12-07 20:49:28 +00:00			`## Recursion`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
			`Before descending recursively into a URL, it has to fulfill several`
Minor documentation improvements. 2011-01-31 09:48:32 +00:00			`conditions. The conditions are checked in this order:`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Minor documentation improvements. 2011-01-31 09:48:32 +00:00			`1. The URL must be valid.`
			`2. The URL must be parseable. This currently includes HTML files,`
Updated documentation. 2011-10-19 18:08:27 +00:00			`Opera bookmarks files, directories and on Windows systems MS Word`
			`files if Word is installed on your system. If a file type cannot`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`be determined (for example it does not have a common HTML file`
			`extension, and the content does not look like HTML), it is assumed`
			`to be non-parseable.`
			`3. The URL content must be retrievable. This is usually the case`
			`except for example mailto: or unknown URL types.`
			`4. The maximum recursion level must not be exceeded. It is configured`
Minor documentation improvements. 2011-01-31 09:48:32 +00:00			with the ``--recursion-level`` command line option, the recursion
			`level GUI option, or through the configuration file.`
			`The recursion level is unlimited by default.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`5. It must not match the ignored URL list. This is controlled with`
Minor documentation improvements. 2011-01-31 09:48:32 +00:00			the ``--ignore-url`` command line option or through the
			`configuration file.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			`6. The Robots Exclusion Protocol must allow links in the URL to be`
Updated documentation. 2011-10-19 18:08:27 +00:00			`followed recursively. This is checked by evaluating the servers`
			`robots.txt file and searching for a "nofollow" directive in the`
			`HTML header data.`
Updated HTML documentation 2009-08-19 18:41:17 +00:00
Updated documentation. 2011-02-11 17:50:12 +00:00			`Note that the local and FTP directory recursion reads all files in that`
Updated HTML documentation 2009-08-19 18:41:17 +00:00			directory, not just a subset like ``index.htm*``.
Document configuration file locations. 2011-04-02 09:06:56 +00:00

Remove support for system configuration file. 2011-05-20 19:10:31 +00:00			`## Configuration file`
Document configuration file locations. 2011-04-02 09:06:56 +00:00
Remove support for system configuration file. 2011-05-20 19:10:31 +00:00			`Each user can edit a configuration with advanced options for`
			`checking or filtering.`
Document configuration file locations. 2011-04-02 09:06:56 +00:00
Updated documentation. 2011-10-19 18:08:27 +00:00			`On Unix or OS X systems the user configuration file is at`
Document configuration file locations. 2011-04-02 09:06:56 +00:00
Remove support for system configuration file. 2011-05-20 19:10:31 +00:00			- ``~/.linkchecker/linkcheckerrc``
Document configuration file locations. 2011-04-02 09:06:56 +00:00
Remove support for system configuration file. 2011-05-20 19:10:31 +00:00			`On Windows the user configuration file is at`
Document configuration file locations. 2011-04-02 09:06:56 +00:00
Remove support for system configuration file. 2011-05-20 19:10:31 +00:00			- ``%HOMEPATH%\.linkchecker\linkcheckerrc``