mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-19 07:20:26 +00:00
270 lines
9.3 KiB
Text
270 lines
9.3 KiB
Text
Documentation
|
|
=============
|
|
|
|
Basic usage
|
|
-----------
|
|
|
|
To check a URL like ``http://www.myhomepage.org/`` it is enough to
|
|
execute ``linkchecker http://www.myhomepage.org/``. This will check the
|
|
complete domain of www.myhomepage.org recursively. All links pointing
|
|
outside of the domain are also checked for validity.
|
|
|
|
Performed checks
|
|
----------------
|
|
|
|
All URLs have to pass a preliminary syntax test. Minor quoting
|
|
mistakes will issue a warning, all other invalid syntax issues
|
|
are errors.
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.
|
|
|
|
- HTTP links (``http:``, ``https:``)
|
|
|
|
After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
Permanently moved pages issue a warning.
|
|
All final HTTP status codes other than 2xx are errors.
|
|
|
|
- Local files (``file:``)
|
|
|
|
A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example device files,
|
|
unreadable or non-existing files are errors.
|
|
|
|
File contents are checked for recursion.
|
|
|
|
- Mail links (``mailto:``)
|
|
|
|
A mailto: link eventually resolves to a list of email addresses.
|
|
If one address fails, the whole list will fail.
|
|
For each mail address we check the following things:
|
|
|
|
1) Check the adress syntax, both of the part before and after
|
|
the @ sign.
|
|
2) Look up the MX DNS records. If we found no MX record,
|
|
print an error.
|
|
3) Check if one of the mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If no host accepts SMTP, we print a warning.
|
|
4) Try to verify the address with the VRFY command. If we got
|
|
an answer, print the verified address as an info.
|
|
|
|
- FTP links (``ftp:``)
|
|
|
|
For FTP links we do:
|
|
|
|
1) connect to the specified host
|
|
2) try to login with the given user and password. The default
|
|
user is ``anonymous``, the default password is ``anonymous@``.
|
|
3) try to change to the given directory
|
|
4) list the file with the NLST command
|
|
|
|
- Telnet links (``telnet:``)
|
|
|
|
We try to connect and if user/password are given, login to the
|
|
given telnet server.
|
|
|
|
- NNTP links (``news:``, ``snews:``, ``nntp``)
|
|
|
|
We try to connect to the given NNTP server. If a news group or
|
|
article is specified, try to request it from the server.
|
|
|
|
- Ignored links (``javascript:``, etc.)
|
|
|
|
An ignored link will only print a warning. No further checking
|
|
will be made.
|
|
|
|
Here is a complete list of recognized, but ignored links. The most
|
|
prominent of them should be JavaScript links.
|
|
|
|
- ``acap:`` (application configuration access protocol)
|
|
- ``afs:`` (Andrew File System global file names)
|
|
- ``chrome:`` (Mozilla specific)
|
|
- ``cid:`` (content identifier)
|
|
- ``clsid:`` (Microsoft specific)
|
|
- ``data:`` (data)
|
|
- ``dav:`` (dav)
|
|
- ``fax:`` (fax)
|
|
- ``find:`` (Mozilla specific)
|
|
- ``gopher:`` (Gopher)
|
|
- ``imap:`` (internet message access protocol)
|
|
- ``isbn:`` (ISBN (int. book numbers))
|
|
- ``javascript:`` (JavaScript)
|
|
- ``ldap:`` (Lightweight Directory Access Protocol)
|
|
- ``mailserver:`` (Access to data available from mail servers)
|
|
- ``mid:`` (message identifier)
|
|
- ``mms:`` (multimedia stream)
|
|
- ``modem:`` (modem)
|
|
- ``nfs:`` (network file system protocol)
|
|
- ``opaquelocktoken:`` (opaquelocktoken)
|
|
- ``pop:`` (Post Office Protocol v3)
|
|
- ``prospero:`` (Prospero Directory Service)
|
|
- ``rsync:`` (rsync protocol)
|
|
- ``rtsp:`` (real time streaming protocol)
|
|
- ``service:`` (service location)
|
|
- ``shttp:`` (secure HTTP)
|
|
- ``sip:`` (session initiation protocol)
|
|
- ``tel:`` (telephone)
|
|
- ``tip:`` (Transaction Internet Protocol)
|
|
- ``tn3270:`` (Interactive 3270 emulation sessions)
|
|
- ``vemmi:`` (versatile multimedia interface)
|
|
- ``wais:`` (Wide Area Information Servers)
|
|
- ``z39.50r:`` (Z39.50 Retrieval)
|
|
- ``z39.50s:`` (Z39.50 Session)
|
|
|
|
|
|
Recursion
|
|
---------
|
|
|
|
Before descending recursively into a URL, it has to fulfill several
|
|
conditions. They are checked in this order:
|
|
|
|
1. A URL must be valid.
|
|
|
|
2. A URL must be parseable. This currently includes HTML files,
|
|
Opera bookmarks files, and directories. If a file type cannot
|
|
be determined (for example it does not have a common HTML file
|
|
extension, and the content does not look like HTML), it is assumed
|
|
to be non-parseable.
|
|
|
|
3. The URL content must be retrievable. This is usually the case
|
|
except for example mailto: or unknown URL types.
|
|
|
|
4. The maximum recursion level must not be exceeded. It is configured
|
|
with the ``--recursion-level`` option and is unlimited per default.
|
|
|
|
5. It must not match the ignored URL list. This is controlled with
|
|
the ``--ignore-url`` option.
|
|
|
|
6. The Robots Exclusion Protocol must allow links in the URL to be
|
|
followed recursively. This is checked by searching for a
|
|
"nofollow" directive in the HTML header data.
|
|
|
|
Note that the directory recursion reads all files in that
|
|
directory, not just a subset like ``index.htm*``.
|
|
|
|
|
|
Frequently asked questions
|
|
--------------------------
|
|
|
|
**Q: LinkChecker produced an error, but my web page is ok with
|
|
Mozilla/IE/Opera/...
|
|
Is this a bug in LinkChecker?**
|
|
|
|
A: Please check your web pages first. Are they really ok?
|
|
Use the ``--check-html`` option, or check if you are using a proxy
|
|
which produces the error.
|
|
|
|
**Q: I still get an error, but the page is definitely ok.**
|
|
|
|
A: Some servers deny access of automated tools (also called robots)
|
|
like LinkChecker. This is not a bug in LinkChecker but rather a
|
|
policy by the webmaster running the website you are checking. Look
|
|
the ``/robots.txt`` file which follows the `robots.txt exclusion standard`_.
|
|
|
|
.. _`robots.txt exclusion standard`:
|
|
http://www.robotstxt.org/wc/norobots-rfc.html
|
|
|
|
**Q: How can I tell LinkChecker which proxy to use?**
|
|
|
|
A: LinkChecker works transparently with proxies. In a Unix or Windows
|
|
environment, set the http_proxy, https_proxy, ftp_proxy environment
|
|
variables to a URL that identifies the proxy server before starting
|
|
LinkChecker. For example
|
|
|
|
::
|
|
|
|
$ http_proxy="http://www.someproxy.com:3128"
|
|
$ export http_proxy
|
|
|
|
|
|
**Q: The link "mailto:john@company.com?subject=Hello John" is reported
|
|
as an error.**
|
|
|
|
A: You have to quote special characters (e.g. spaces) in the subject field.
|
|
The correct link should be "mailto:...?subject=Hello%20John"
|
|
Unfortunately browsers like IE and Netscape do not enforce this.
|
|
|
|
|
|
**Q: Has LinkChecker JavaScript support?**
|
|
|
|
A: No, it never will. If your page is not working without JS, it is
|
|
better checked with a browser testing tool like Selenium_.
|
|
|
|
.. _Selenium:
|
|
http://seleniumhq.org/
|
|
|
|
|
|
**Q: Is LinkCheckers cookie feature insecure?**
|
|
|
|
A: Cookies can not store more information as is in the HTTP request itself,
|
|
so you are not giving away any more system information.
|
|
After storing however, the cookies are sent out to the server on request.
|
|
Not to every server, but only to the one who the cookie originated from!
|
|
This could be used to "track" subsequent requests to this server,
|
|
and this is what some people annoys (including me).
|
|
Cookies are only stored in memory. After LinkChecker finishes, they
|
|
are lost. So the tracking is restricted to the checking time.
|
|
The cookie feature is disabled as default.
|
|
|
|
|
|
**Q: I want to have my own logging class. How can I use it in LinkChecker?**
|
|
|
|
A: Currently, only a Python API lets you define new logging classes.
|
|
Define your own logging class as a subclass of StandardLogger or any other
|
|
logging class in the log module.
|
|
Then call the addLogger function in Config.Configuration to register
|
|
your new Logger.
|
|
After this append a new Logging instance to the fileoutput.
|
|
|
|
::
|
|
|
|
import linkcheck, MyLogger
|
|
log_format = 'mylog'
|
|
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
|
|
cfg = linkcheck.configuration.Configuration()
|
|
cfg.logger_add(log_format, MyLogger.MyLogger)
|
|
cfg['fileoutput'].append(cfg.logger_new(log_format, log_args))
|
|
|
|
|
|
**Q: LinkChecker does not ignore anchor references on caching.**
|
|
|
|
**Q: Some links with anchors are getting checked twice.**
|
|
|
|
A: This is not a bug.
|
|
It is not necessarily true that if a URL ``ABC#anchor1`` works then
|
|
``ABC#anchor2`` works too. That is not specified anywhere and there are
|
|
server-side scripts that fail on some anchors and not on others.
|
|
This is the reason for always checking URLs with different anchors.
|
|
If you really want to disable this, use the ``--no-anchor-caching``
|
|
option.
|
|
|
|
|
|
**Q: I see LinkChecker gets a /robots.txt file for every site it
|
|
checks. What is that about?**
|
|
|
|
A: LinkChecker follows the `robots.txt exclusion standard`_. To avoid
|
|
misuse of LinkChecker, you cannot turn this feature off.
|
|
See the `Web Robot pages`_ and the `Spidering report`_ for more info.
|
|
|
|
.. _`robots.txt exclusion standard`:
|
|
http://www.robotstxt.org/wc/norobots-rfc.html
|
|
.. _`Web Robot pages`:
|
|
http://www.robotstxt.org/wc/robots.html
|
|
.. _`Spidering report`:
|
|
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
|
|
|
|
|
|
**Q: How do I print unreachable/dead documents of my website with
|
|
LinkChecker?**
|
|
|
|
A: No can do. This would require file system access to your web
|
|
repository and access to your web server configuration.
|
|
|
|
|
|
**Q: How do I check HTML/XML/CSS syntax with LinkChecker?**
|
|
|
|
A: Use the ``--check-html`` and ``--check-css`` options.
|
|
|