linkchecker/doc/html/index.txt

159 lines
5.8 KiB
Text
Raw Normal View History

2010-12-07 20:49:28 +00:00
# Documentation
2009-08-19 18:41:17 +00:00
2010-12-07 20:49:28 +00:00
## Basic usage
2009-08-19 18:41:17 +00:00
2010-12-07 19:59:18 +00:00
To check a URL like ``http://www.example.org/`` it is enough to
2011-10-19 18:08:27 +00:00
type ``linkchecker www.example.org`` on the command line or
2010-12-07 19:59:18 +00:00
type ``www.example.org`` in the GUI application. This will check the
complete domain of ``http://www.example.org`` recursively. All links
pointing outside of the domain are also checked for validity.
2009-08-19 18:41:17 +00:00
2010-12-07 20:49:28 +00:00
## Performed checks
2009-08-19 18:41:17 +00:00
All URLs have to pass a preliminary syntax test. Minor quoting
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.
- HTTP links (``http:``, ``https:``)
After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages issue a warning.
All final HTTP status codes other than 2xx are errors.
- Local files (``file:``)
A regular, readable file that can be opened is valid. A readable
2011-10-19 18:08:27 +00:00
directory is also valid. All other files, for example unreadable,
non-existing or device files are errors.
2009-08-19 18:41:17 +00:00
2011-10-19 18:08:27 +00:00
File contents are checked for recursion. If they are parseable
files (for example HTML files), all links in that file will be
checked.
2009-08-19 18:41:17 +00:00
- Mail links (``mailto:``)
2011-10-19 18:08:27 +00:00
A mailto: link resolves to a list of email addresses.
If one address fails the whole list will fail.
2010-04-01 05:39:31 +00:00
For each mail address the following things are checked:
2009-08-19 18:41:17 +00:00
2011-10-19 18:08:27 +00:00
1. Check the adress syntax, both the part before and after
2009-08-19 18:41:17 +00:00
the @ sign.
2011-02-11 11:48:57 +00:00
2. Look up the MX DNS records. If no MX record is found,
2009-08-19 18:41:17 +00:00
print an error.
2011-10-19 18:08:27 +00:00
3. Check if one of the MX mail hosts accept an SMTP connection.
2009-08-19 18:41:17 +00:00
Check hosts with higher priority first.
2011-10-19 18:08:27 +00:00
If none of the hosts accept SMTP, a warning is printed.
2011-02-11 11:48:57 +00:00
4. Try to verify the address with the VRFY command. If there is
2010-04-01 05:39:31 +00:00
an answer, the verified address is printed as an info.
2009-08-19 18:41:17 +00:00
- FTP links (``ftp:``)
2010-04-01 05:39:31 +00:00
For FTP links the following is checked:
2009-08-19 18:41:17 +00:00
2011-02-11 11:48:57 +00:00
1. Connect to the specified host.
2. Try to login with the given user and password. The default
2009-08-19 18:41:17 +00:00
user is ``anonymous``, the default password is ``anonymous@``.
2011-02-11 11:48:57 +00:00
3. Try to change to the given directory.
4. List the file with the NLST command.
2009-08-19 18:41:17 +00:00
- Telnet links (``telnet:``)
2010-04-01 05:39:31 +00:00
A connect and if user/password are given, login to the
given telnet server is tried.
2009-08-19 18:41:17 +00:00
- NNTP links (``news:``, ``snews:``, ``nntp``)
2010-04-01 05:39:31 +00:00
A connect is tried to connect to the given NNTP server. If a news group or
article is specified, it will be requested from the server.
2009-08-19 18:41:17 +00:00
- Ignored links (``javascript:``, etc.)
2011-01-31 09:48:32 +00:00
An ignored link will print a warning, but no error. No further checking
2009-08-19 18:41:17 +00:00
will be made.
2011-01-31 09:48:32 +00:00
Here is the complete list of recognized, but ignored links. The most
prominent of them are JavaScript links.
2009-08-19 18:41:17 +00:00
- ``acap:`` (application configuration access protocol)
- ``afs:`` (Andrew File System global file names)
- ``chrome:`` (Mozilla specific)
- ``cid:`` (content identifier)
- ``clsid:`` (Microsoft specific)
- ``data:`` (data)
- ``dav:`` (dav)
- ``fax:`` (fax)
- ``find:`` (Mozilla specific)
- ``gopher:`` (Gopher)
- ``imap:`` (internet message access protocol)
2010-11-10 18:56:31 +00:00
- ``irc:`` (internet relay chat)
2009-08-19 18:41:17 +00:00
- ``isbn:`` (ISBN (int. book numbers))
- ``javascript:`` (JavaScript)
- ``ldap:`` (Lightweight Directory Access Protocol)
- ``mailserver:`` (Access to data available from mail servers)
- ``mid:`` (message identifier)
- ``mms:`` (multimedia stream)
- ``modem:`` (modem)
- ``nfs:`` (network file system protocol)
- ``opaquelocktoken:`` (opaquelocktoken)
- ``pop:`` (Post Office Protocol v3)
- ``prospero:`` (Prospero Directory Service)
- ``rsync:`` (rsync protocol)
- ``rtsp:`` (real time streaming protocol)
- ``service:`` (service location)
- ``shttp:`` (secure HTTP)
- ``sip:`` (session initiation protocol)
- ``tel:`` (telephone)
- ``tip:`` (Transaction Internet Protocol)
- ``tn3270:`` (Interactive 3270 emulation sessions)
- ``vemmi:`` (versatile multimedia interface)
- ``wais:`` (Wide Area Information Servers)
- ``z39.50r:`` (Z39.50 Retrieval)
- ``z39.50s:`` (Z39.50 Session)
2010-12-07 20:49:28 +00:00
## Recursion
2009-08-19 18:41:17 +00:00
Before descending recursively into a URL, it has to fulfill several
2011-01-31 09:48:32 +00:00
conditions. The conditions are checked in this order:
2009-08-19 18:41:17 +00:00
2011-01-31 09:48:32 +00:00
1. The URL must be valid.
2. The URL must be parseable. This currently includes HTML files,
2011-10-19 18:08:27 +00:00
Opera bookmarks files, directories and on Windows systems MS Word
files if Word is installed on your system. If a file type cannot
2009-08-19 18:41:17 +00:00
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.
3. The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.
4. The maximum recursion level must not be exceeded. It is configured
2011-01-31 09:48:32 +00:00
with the ``--recursion-level`` command line option, the recursion
level GUI option, or through the configuration file.
The recursion level is unlimited by default.
2009-08-19 18:41:17 +00:00
5. It must not match the ignored URL list. This is controlled with
2011-01-31 09:48:32 +00:00
the ``--ignore-url`` command line option or through the
configuration file.
2009-08-19 18:41:17 +00:00
6. The Robots Exclusion Protocol must allow links in the URL to be
2011-10-19 18:08:27 +00:00
followed recursively. This is checked by evaluating the servers
robots.txt file and searching for a "nofollow" directive in the
HTML header data.
2009-08-19 18:41:17 +00:00
2011-02-11 17:50:12 +00:00
Note that the local and FTP directory recursion reads all files in that
2009-08-19 18:41:17 +00:00
directory, not just a subset like ``index.htm*``.
2011-04-02 09:06:56 +00:00
## Configuration file
2011-04-02 09:06:56 +00:00
Each user can edit a configuration with advanced options for
checking or filtering.
2011-04-02 09:06:56 +00:00
2011-10-19 18:08:27 +00:00
On Unix or OS X systems the user configuration file is at
2011-04-02 09:06:56 +00:00
- ``~/.linkchecker/linkcheckerrc``
2011-04-02 09:06:56 +00:00
On Windows the user configuration file is at
2011-04-02 09:06:56 +00:00
- ``%HOMEPATH%\.linkchecker\linkcheckerrc``