mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-20 07:50:24 +00:00
147 lines
5.2 KiB
Text
147 lines
5.2 KiB
Text
Documentation
|
|
=============
|
|
|
|
Basic usage
|
|
-----------
|
|
|
|
To check a URL like ``http://www.myhomepage.org/`` it is enough to
|
|
execute ``linkchecker http://www.myhomepage.org/``. This will check the
|
|
complete domain of www.myhomepage.org recursively. All links pointing
|
|
outside of the domain are also checked for validity.
|
|
|
|
Performed checks
|
|
----------------
|
|
|
|
All URLs have to pass a preliminary syntax test. Minor quoting
|
|
mistakes will issue a warning, all other invalid syntax issues
|
|
are errors.
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.
|
|
|
|
- HTTP links (``http:``, ``https:``)
|
|
|
|
After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
Permanently moved pages issue a warning.
|
|
All final HTTP status codes other than 2xx are errors.
|
|
|
|
- Local files (``file:``)
|
|
|
|
A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example device files,
|
|
unreadable or non-existing files are errors.
|
|
|
|
File contents are checked for recursion.
|
|
|
|
- Mail links (``mailto:``)
|
|
|
|
A mailto: link eventually resolves to a list of email addresses.
|
|
If one address fails, the whole list will fail.
|
|
For each mail address the following things are checked:
|
|
|
|
1) Check the adress syntax, both of the part before and after
|
|
the @ sign.
|
|
2) Look up the MX DNS records. If no MX record is found,
|
|
print an error.
|
|
3) Check if one of the mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If no host accepts SMTP, a warning is printed.
|
|
4) Try to verify the address with the VRFY command. If there is
|
|
an answer, the verified address is printed as an info.
|
|
|
|
- FTP links (``ftp:``)
|
|
|
|
For FTP links the following is checked:
|
|
|
|
1) connect to the specified host
|
|
2) try to login with the given user and password. The default
|
|
user is ``anonymous``, the default password is ``anonymous@``.
|
|
3) try to change to the given directory
|
|
4) list the file with the NLST command
|
|
|
|
- Telnet links (``telnet:``)
|
|
|
|
A connect and if user/password are given, login to the
|
|
given telnet server is tried.
|
|
|
|
- NNTP links (``news:``, ``snews:``, ``nntp``)
|
|
|
|
A connect is tried to connect to the given NNTP server. If a news group or
|
|
article is specified, it will be requested from the server.
|
|
|
|
- Ignored links (``javascript:``, etc.)
|
|
|
|
An ignored link will only print a warning. No further checking
|
|
will be made.
|
|
|
|
Here is a complete list of recognized, but ignored links. The most
|
|
prominent of them should be JavaScript links.
|
|
|
|
- ``acap:`` (application configuration access protocol)
|
|
- ``afs:`` (Andrew File System global file names)
|
|
- ``chrome:`` (Mozilla specific)
|
|
- ``cid:`` (content identifier)
|
|
- ``clsid:`` (Microsoft specific)
|
|
- ``data:`` (data)
|
|
- ``dav:`` (dav)
|
|
- ``fax:`` (fax)
|
|
- ``find:`` (Mozilla specific)
|
|
- ``gopher:`` (Gopher)
|
|
- ``imap:`` (internet message access protocol)
|
|
- ``irc:`` (internet relay chat)
|
|
- ``isbn:`` (ISBN (int. book numbers))
|
|
- ``javascript:`` (JavaScript)
|
|
- ``ldap:`` (Lightweight Directory Access Protocol)
|
|
- ``mailserver:`` (Access to data available from mail servers)
|
|
- ``mid:`` (message identifier)
|
|
- ``mms:`` (multimedia stream)
|
|
- ``modem:`` (modem)
|
|
- ``nfs:`` (network file system protocol)
|
|
- ``opaquelocktoken:`` (opaquelocktoken)
|
|
- ``pop:`` (Post Office Protocol v3)
|
|
- ``prospero:`` (Prospero Directory Service)
|
|
- ``rsync:`` (rsync protocol)
|
|
- ``rtsp:`` (real time streaming protocol)
|
|
- ``service:`` (service location)
|
|
- ``shttp:`` (secure HTTP)
|
|
- ``sip:`` (session initiation protocol)
|
|
- ``tel:`` (telephone)
|
|
- ``tip:`` (Transaction Internet Protocol)
|
|
- ``tn3270:`` (Interactive 3270 emulation sessions)
|
|
- ``vemmi:`` (versatile multimedia interface)
|
|
- ``wais:`` (Wide Area Information Servers)
|
|
- ``z39.50r:`` (Z39.50 Retrieval)
|
|
- ``z39.50s:`` (Z39.50 Session)
|
|
|
|
|
|
Recursion
|
|
---------
|
|
|
|
Before descending recursively into a URL, it has to fulfill several
|
|
conditions. They are checked in this order:
|
|
|
|
1. A URL must be valid.
|
|
|
|
2. A URL must be parseable. This currently includes HTML files,
|
|
Opera bookmarks files, and directories. If a file type cannot
|
|
be determined (for example it does not have a common HTML file
|
|
extension, and the content does not look like HTML), it is assumed
|
|
to be non-parseable.
|
|
|
|
3. The URL content must be retrievable. This is usually the case
|
|
except for example mailto: or unknown URL types.
|
|
|
|
4. The maximum recursion level must not be exceeded. It is configured
|
|
with the ``--recursion-level`` option and is unlimited per default.
|
|
|
|
5. It must not match the ignored URL list. This is controlled with
|
|
the ``--ignore-url`` option.
|
|
|
|
6. The Robots Exclusion Protocol must allow links in the URL to be
|
|
followed recursively. This is checked by searching for a
|
|
"nofollow" directive in the HTML header data.
|
|
|
|
Note that the directory recursion reads all files in that
|
|
directory, not just a subset like ``index.htm*``.
|