mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-21 00:10:24 +00:00
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1617 e7d03fd6-7b0d-0410-9947-9c21f3af8025
336 lines
11 KiB
Text
336 lines
11 KiB
Text
.. meta::
|
|
:navigation.order: 3
|
|
:navigation.name: Documentation
|
|
|
|
Documentation
|
|
=============
|
|
|
|
.. contents::
|
|
|
|
Basic usage
|
|
-----------
|
|
|
|
To check an URL like ``http://www.myhomepage.org/`` it is enough to
|
|
execute ``linkchecker http://www.myhomepage.org/``. This will check the
|
|
complete domain of www.myhomepage.org recursively. All links pointing
|
|
outside of the domain are also checked for validity.
|
|
|
|
For more options, read the man page ``linkchecker(1)`` or execute
|
|
``linkchecker -h``.
|
|
|
|
|
|
Performed checks
|
|
----------------
|
|
|
|
All URLs have to pass a preliminary syntax test. Minor quoting
|
|
mistakes will issue a warning, all other invalid syntax issues
|
|
are errors.
|
|
After the syntax check passes, the URL is queued for connection
|
|
checking. All connection check types are described below.
|
|
|
|
- HTTP links (``http:``, ``https:``)
|
|
|
|
After connecting to the given HTTP server the given path
|
|
or query is requested. All redirections are followed, and
|
|
if user/password is given it will be used as authorization
|
|
when necessary.
|
|
Permanently moved pages issue a warning.
|
|
All final HTTP status codes other than 2xx are errors.
|
|
|
|
- Local files (``file:``)
|
|
|
|
A regular, readable file that can be opened is valid. A readable
|
|
directory is also valid. All other files, for example device files,
|
|
unreadable or non-existing files are errors.
|
|
|
|
File contents are checked for recursion.
|
|
|
|
- Mail links (``mailto:``)
|
|
|
|
A mailto: link eventually resolves to a list of email addresses.
|
|
If one address fails, the whole list will fail.
|
|
For each mail address we check the following things:
|
|
|
|
1) Look up the MX DNS records. If we found no MX record,
|
|
print an error.
|
|
2) Check if one of the mail hosts accept an SMTP connection.
|
|
Check hosts with higher priority first.
|
|
If no host accepts SMTP, we print a warning.
|
|
3) Try to verify the address with the VRFY command. If we got
|
|
an answer, print the verified address as an info.
|
|
|
|
- FTP links (``ftp:``)
|
|
|
|
For FTP links we do:
|
|
|
|
1) connect to the specified host
|
|
2) try to login with the given user and password. The default
|
|
user is ``anonymous``, the default password is ``anonymous@``.
|
|
3) try to change to the given directory
|
|
4) list the file with the NLST command
|
|
|
|
- Gopher links (``gopher:``)
|
|
|
|
We try to send the given selector (or query) to the gopher server.
|
|
|
|
- Telnet links (``telnet:``)
|
|
|
|
We try to connect and if user/password are given, login to the
|
|
given telnet server.
|
|
|
|
- NNTP links (``news:``, ``snews:``, ``nntp``)
|
|
|
|
We try to connect to the given NNTP server. If a news group or
|
|
article is specified, try to request it from the server.
|
|
|
|
- Ignored links (``javascript:``, etc.)
|
|
|
|
An ignored link will only print a warning. No further checking
|
|
will be made.
|
|
|
|
Here is a complete list of recognized, but ignored links. The most
|
|
prominent of them should be JavaScript links.
|
|
|
|
- ``acap:`` (application configuration access protocol)
|
|
- ``afs:`` (Andrew File System global file names)
|
|
- ``chrome:`` (Mozilla specific)
|
|
- ``cid:`` (content identifier)
|
|
- ``clsid:`` (Microsoft specific)
|
|
- ``data:`` (data)
|
|
- ``dav:`` (dav)
|
|
- ``fax:`` (fax)
|
|
- ``find:`` (Mozilla specific)
|
|
- ``imap:`` (internet message access protocol)
|
|
- ``isbn:`` (ISBN (int. book numbers))
|
|
- ``javascript:`` (JavaScript)
|
|
- ``ldap:`` (Lightweight Directory Access Protocol)
|
|
- ``mailserver:`` (Access to data available from mail servers)
|
|
- ``mid:`` (message identifier)
|
|
- ``mms:`` (multimedia stream)
|
|
- ``modem:`` (modem)
|
|
- ``nfs:`` (network file system protocol)
|
|
- ``opaquelocktoken:`` (opaquelocktoken)
|
|
- ``pop:`` (Post Office Protocol v3)
|
|
- ``prospero:`` (Prospero Directory Service)
|
|
- ``rsync:`` (rsync protocol)
|
|
- ``rtsp:`` (real time streaming protocol)
|
|
- ``service:`` (service location)
|
|
- ``shttp:`` (secure HTTP)
|
|
- ``sip:`` (session initiation protocol)
|
|
- ``tel:`` (telephone)
|
|
- ``tip:`` (Transaction Internet Protocol)
|
|
- ``tn3270:`` (Interactive 3270 emulation sessions)
|
|
- ``vemmi:`` (versatile multimedia interface)
|
|
- ``wais:`` (Wide Area Information Servers)
|
|
- ``z39.50r:`` (Z39.50 Retrieval)
|
|
- ``z39.50s:`` (Z39.50 Session)
|
|
|
|
|
|
Recursion
|
|
---------
|
|
|
|
Recursion occurs on HTML files, Opera bookmark files and directories.
|
|
Note that the directory recursion reads all files in that
|
|
directory, not just a subset like ``index.htm*``.
|
|
|
|
.. meta::
|
|
:navigation.order: 4
|
|
:navigation.name: FAQ
|
|
|
|
|
|
Frequently asked questions
|
|
--------------------------
|
|
|
|
**Q: LinkChecker produced an error, but my web page is ok with
|
|
Netscape/IE/Opera/...
|
|
Is this a bug in LinkChecker?**
|
|
|
|
A: Please check your web pages first. Are they really ok? Use
|
|
a `syntax highlighting editor`_. Use `HTML Tidy`_.
|
|
Check if you are using a proxy which produces the error.
|
|
|
|
.. _`syntax highlighting editor`:
|
|
http://fte.sourceforge.net/
|
|
.. _`HTML Tidy`:
|
|
http://tidy.sourceforge.net/
|
|
|
|
|
|
**Q: I still get an error, but the page is definitely ok.**
|
|
|
|
A: Some servers deny access of automated tools (also called robots)
|
|
like LinkChecker. This is not a bug in LinkChecker but rather a
|
|
policy by the webmaster running the website you are checking.
|
|
It might even be possible for a website to send robots different
|
|
web pages than normal browsers.
|
|
|
|
|
|
**Q: How can I tell LinkChecker which proxy to use?**
|
|
|
|
A: LinkChecker works transparently with proxies. In a Unix or Windows
|
|
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
|
|
environment variables to a URL that identifies the proxy server before
|
|
starting LinkChecker. For example
|
|
|
|
::
|
|
|
|
$ http_proxy="http://www.someproxy.com:3128"
|
|
$ export http_proxy
|
|
|
|
In a Macintosh environment, LinkChecker will retrieve proxy information
|
|
from Internet Config.
|
|
|
|
|
|
**Q: The link "mailto:john@company.com?subject=Hello John" is reported
|
|
as an error.**
|
|
|
|
A: You have to quote special characters (e.g. spaces) in the subject field.
|
|
The correct link should be "mailto:...?subject=Hello%20John"
|
|
Unfortunately browsers like IE and Netscape do not enforce this.
|
|
|
|
|
|
**Q: Has LinkChecker JavaScript support?**
|
|
|
|
A: No, it never will. If your page is not working without JS then your
|
|
web design is broken.
|
|
Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
|
|
an addon for your web pages.
|
|
|
|
|
|
**Q: I don't get this --extern/--intern stuff.**
|
|
|
|
A: When it comes to checking there are three types of URLs. Note
|
|
that local files are also represented als URLs (ie file://). So
|
|
local files can be external URLs.
|
|
|
|
1) strict external URLs:
|
|
We do only syntax checking. Internal URLs are never strict.
|
|
2) external URLs:
|
|
Like 1), but we additionally check if they are valid by connect()ing
|
|
to them
|
|
3) internal URLs:
|
|
Like 2), but we additionally check if they are HTML pages and if so,
|
|
we descend recursively into this link and check all the links in the
|
|
HTML content.
|
|
The --recursion-level option restricts the number of such recursive
|
|
descends.
|
|
|
|
LinkChecker provides four options which affect URLs to fall in one
|
|
of those three categories: --intern, --extern, --extern-strict-all and
|
|
--denyallow.
|
|
By default all URLs are internal. With --extern you specify what URLs
|
|
are external. With --intern you specify what URLs are internal.
|
|
Now imagine you have both --extern and --intern. What happens
|
|
when an URL matches both patterns? Or when it matches none? In this
|
|
situation the --denyallow option specifies the order in which we match
|
|
the URL. By default it is internal/external, with --denyallow the order is
|
|
external/internal. Either way, the first match counts, and if none matches,
|
|
the last checked category is the category for the URL.
|
|
Finally, with --extern-strict-all all external URLs are strict.
|
|
|
|
Oh, and just to boggle your mind: you can have more than one external
|
|
regular expression in a config file and for each of those expressions
|
|
you can specify if those matched external URLs should be strict or not.
|
|
|
|
An example. We don't want to check mailto urls. Then its
|
|
-i'!^mailto:'. The '!' negates an expression. With --extern-strictall,
|
|
we don't even connect to any mail hosts.
|
|
|
|
Another example. We check our site www.mycompany.com, don't recurse
|
|
into external links point outside from our site and want to ignore links
|
|
to hollowood.com and hullabulla.com completely.
|
|
This can only be done with a configuration entry like
|
|
|
|
::
|
|
|
|
[filtering]
|
|
extern1=hollowood.com 1
|
|
extern2=hullabulla.com 1
|
|
# the 1 means strict external ie don't even connect
|
|
|
|
and the command
|
|
``linkchecker --intern=www.mycompany.com www.mycompany.com``
|
|
|
|
|
|
**Q: Is LinkCheckers cookie feature insecure?**
|
|
|
|
A: Cookies can not store more information as is in the HTTP request itself,
|
|
so you are not giving away any more system information.
|
|
After storing however, the cookies are sent out to the server on request.
|
|
Not to every server, but only to the one who the cookie originated from!
|
|
This could be used to "track" subsequent requests to this server,
|
|
and this is what some people annoys (including me).
|
|
Cookies are only stored in memory. After LinkChecker finishes, they
|
|
are lost. So the tracking is restricted to the checking time.
|
|
The cookie feature is disabled as default.
|
|
|
|
|
|
**Q: I want to have my own logging class. How can I use it in LinkChecker?**
|
|
|
|
A: Currently, only a Python API lets you define new logging classes.
|
|
Define your own logging class as a subclass of StandardLogger or any other
|
|
logging class in the log module.
|
|
Then call the addLogger function in Config.Configuration to register
|
|
your new Logger.
|
|
After this append a new Logging instance to the fileoutput.
|
|
|
|
::
|
|
|
|
import linkcheck, MyLogger
|
|
log_format = 'mylog'
|
|
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
|
|
cfg = linkcheck.configuration.Configuration()
|
|
cfg.logger_add(log_format, MyLogger.MyLogger)
|
|
cfg['fileoutput'].append(cfg.logger_new(log_format, log_args))
|
|
|
|
|
|
**Q: LinkChecker does not ignore anchor references on caching.**
|
|
|
|
**Q: Some links with anchors are getting checked twice.**
|
|
|
|
A: This is not a bug.
|
|
It is common practice to believe that if an URL ``ABC#anchor1`` works then
|
|
``ABC#anchor2`` works too. That is not specified anywhere and I have seen
|
|
server-side scripts that fail on some anchors and not on others.
|
|
This is the reason for always checking URLs with different anchors.
|
|
If you really want to disable this, use the ``--no-anchor-caching``
|
|
option.
|
|
|
|
|
|
**Q: I see LinkChecker gets a /robots.txt file for every site it
|
|
checks. What is that about?**
|
|
|
|
A: LinkChecker follows the robots.txt exclusion standard. To avoid
|
|
misuse of LinkChecker, you cannot turn this feature off.
|
|
See the `Web Robot pages`_ and the `Spidering report`_ for more info.
|
|
|
|
.. _`Web Robot pages`:
|
|
http://www.robotstxt.org/wc/robots.html
|
|
.. _`Spidering report`:
|
|
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
|
|
|
|
|
|
**Q: Ctrl-C does not stop LinkChecker immediately. Why is that so?**
|
|
|
|
A: The Python interpreter has to wait for all threads to finish, and
|
|
this means waiting for all open sockets to close. The default timeout
|
|
for sockets is 30 seconds, hence the delay.
|
|
You can change the default socket timeout with the --timeout option.
|
|
|
|
|
|
**Q: How do I print unreachable/dead documents of my website with
|
|
LinkChecker?**
|
|
|
|
A: No can do. This would require file system access to your web
|
|
repository and access to your web server configuration.
|
|
|
|
You can instead store the linkchecker results in a database
|
|
and look for missing files.
|
|
|
|
|
|
**Q: How do I check HTML/XML syntax with LinkChecker?**
|
|
|
|
A: No can do. Use the `HTML Tidy`_ program.
|
|
|
|
.. _`HTML Tidy`:
|
|
http://tidy.sourceforge.net/
|
|
|