mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-23 09:20:30 +00:00
175 lines
6.5 KiB
Text
175 lines
6.5 KiB
Text
# Documentation
|
|
|
|
## Basic usage
|
|
|
|
To check a URL like [http://www.example.org/](http://www.example.org/)
|
|
it is enough to
|
|
type ``linkchecker www.example.org`` on the command line or
|
|
type ``www.example.org`` in the GUI application. This will check the
|
|
complete domain of ``http://www.example.org`` recursively. All links
|
|
pointing outside of the domain are also checked for validity.
|
|
|
|
Local files can also be checked. On Unix or OSX systems the syntax
|
|
is ``file:///path/to/my/file.html``. On Windows the syntax is
|
|
``file://C|/path/to/my/file.html``. When directories are checked,
|
|
all included files will be checked.
|
|
|
|
On the GUI client the ``Edit`` menu has shortcuts for bookmark
|
|
files. For example if Google Chrome is installed, there will be
|
|
a menu entry called ``Insert Google Chrome bookmark file`` which
|
|
can be used to check all browser bookmarks.
|
|
|
|
## Options
|
|
|
|
The commandline client options are documented
|
|
in the [linkchecker(1) manual page](http://wummel.github.io/linkchecker/man1/linkchecker.1.html).
|
|
|
|
In the GUI client, the following options are available:
|
|
|
|
- Recursive depth
|
|
|
|
Check recursively all links up to given depth.
|
|
A negative depth (eg. ``-1``) will enable infinite recursion.
|
|
|
|
- Verbose output
|
|
|
|
If set, log all checked URLs. Default is to log only errors and warnings.
|
|
|
|
- Debug
|
|
|
|
Prints debugging output in a separate window which can be seen with
|
|
``Help -> Show debug``.
|
|
|
|
- Debug memory usage
|
|
|
|
Profiles memory usage and writes statistics and a dump file when checking
|
|
stops. The dump file can be examined with
|
|
[external tools](https://github.com/wummel/linkchecker/blob/master/tests/analyze_memdump.py).
|
|
This option should only be useful for developers.
|
|
|
|
- Warning strings
|
|
|
|
Log a warning if any strings are found in the content of the checked
|
|
URL. Strings are entered one per line.
|
|
|
|
Use this to check for pages that contain some form of error, for example
|
|
"``This page has moved``" or "``Oracle Application error``".
|
|
|
|
- Ignoring URLs
|
|
|
|
URLs matching the given regular expressions will be ignored and not checked.
|
|
Useful if certain URL types should not be checked like emails (ie.
|
|
"``^mailto:``").
|
|
|
|
## Configuration file
|
|
|
|
Each user can edit a configuration with advanced options for
|
|
checking or filtering. The
|
|
[linkcheckerrc(5) manual page](http://wummel.github.io/linkchecker/man5/linkcheckerrc.5.html)
|
|
documents all the options.
|
|
|
|
In the GUI client the configuration file can be edited directly from
|
|
the dialog ``Edit -> Options`` and the clicking on ``Edit``.
|
|
|
|
## Performed checks
|
|
|
|
All URLs have to pass a preliminary syntax test.
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.
|
|
|
|
- HTTP links (``http:``, ``https:``)
|
|
|
|
After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
Permanently moved pages (status code 301) issue a warning.
|
|
All final HTTP status codes other than 2xx are errors.
|
|
|
|
For HTTPS links, the SSL certificate is checked against the
|
|
given hostname. If it does not match, a warnings is printed.
|
|
|
|
- Local files (``file:``)
|
|
|
|
A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example unreadable,
|
|
non-existing or device files are errors.
|
|
|
|
File contents are checked for recursion. If they are parseable
|
|
files (for example HTML files), all links in that file will be
|
|
checked.
|
|
|
|
- Mail links (``mailto:``)
|
|
|
|
A mailto: link resolves to a list of email addresses.
|
|
If one email address fails the whole list will fail.
|
|
For each email address the following things are checked:
|
|
|
|
1. Check the address syntax, both the part before and after
|
|
the @ sign.
|
|
2. Look up the MX DNS records. If no MX record is found,
|
|
print an error.
|
|
3. Check if one of the MX mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If none of the hosts accept SMTP, a warning is printed.
|
|
4. Try to verify the address with the VRFY command. If there is
|
|
an answer, the verified address is printed as an info.
|
|
|
|
- FTP links (``ftp:``)
|
|
|
|
For FTP links the following is checked:
|
|
|
|
1. Connect to the specified host.
|
|
2. Try to login with the given user and password. The default
|
|
user is ``anonymous``, the default password is ``anonymous@``.
|
|
3. Try to change to the given directory.
|
|
4. List the file with the NLST command.
|
|
|
|
- Telnet links (``telnet:``)
|
|
|
|
A connect and if user/password are given, login to the
|
|
given telnet server is tried.
|
|
|
|
- NNTP links (``news:``, ``snews:``, ``nntp``)
|
|
|
|
A connect is tried to connect to the given NNTP server. If a news group or
|
|
article is specified, it will be requested from the server.
|
|
|
|
- Unsupported links (``javascript:``, etc.)
|
|
|
|
An unsupported link will print a warning, but no error. No further checking
|
|
will be made.
|
|
|
|
The complete list of recognized, but unsupported links can be seen in the
|
|
[unknownurl.py](https://github.com/wummel/linkchecker/blob/master/linkcheck/checker/unknownurl.py)
|
|
source file. The most prominent of them are JavaScript links.
|
|
|
|
## Recursion
|
|
|
|
Before descending recursively into a URL, it has to fulfill several
|
|
conditions. The conditions are checked in this order:
|
|
|
|
1. The URL must be valid.
|
|
2. The URL must be parseable. This currently includes HTML files,
|
|
Bookmarks files (Opera, Chrome or Safari), directories and on
|
|
Windows systems MS Word files if Word and the Pywin32 module
|
|
is installed on your system.
|
|
If a file type cannot be determined (for example it does not have
|
|
a common HTML file extension, and the content does not look like
|
|
HTML), it is assumed to be non-parseable.
|
|
3. The URL content must be retrievable. This is usually the case
|
|
except for example mailto: or unknown URL types.
|
|
4. The maximum recursion level must not be exceeded. It is configured
|
|
with the ``--recursion-level`` command line option, the recursion
|
|
level GUI option, or through the configuration file.
|
|
The recursion level is unlimited by default.
|
|
5. It must not match the ignored URL list. This is controlled with
|
|
the ``--ignore-url`` command line option or through the
|
|
configuration file.
|
|
6. The Robots Exclusion Protocol must allow links in the URL to be
|
|
followed recursively. This is checked by evaluating the servers
|
|
robots.txt file and searching for a "nofollow" directive in the
|
|
HTML header data.
|
|
|
|
Note that the local and FTP directory recursion reads all files in that
|
|
directory, not just a subset like ``index.htm*``.
|