mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-22 17:00:25 +00:00
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@92 e7d03fd6-7b0d-0410-9947-9c21f3af8025
92 lines
3.4 KiB
Text
92 lines
3.4 KiB
Text
LinkChecker
|
|
=============
|
|
|
|
With LinkChecker you can check your HTML documents for broken links.
|
|
|
|
Features
|
|
--------
|
|
o recursive checking
|
|
o multithreaded
|
|
o output can be colored or normal text, HTML, SQL, CSV or a GML sitemap
|
|
graph
|
|
o HTTP/1.1, HTTPS, FTP, mailto:, news:, Gopher, Telnet and local file links
|
|
are supported
|
|
Javascript links are currently ignored
|
|
o restrict link checking with regular expression filters for URLs
|
|
o HTTP proxy support
|
|
o give username/password for HTTP and FTP authorization
|
|
o robots.txt exclusion protocol support
|
|
o internationalization support (currently english and german)
|
|
|
|
|
|
License
|
|
--------
|
|
LinkChecker is licensed under the GNU Public License.
|
|
Credits go to Guido van Rossum for making Python. His hovercraft is
|
|
full of eels!
|
|
As this program is directly derived from my Java link checker, additional
|
|
credits go to Robert Forsman (the author of JCheckLinks) and his
|
|
robots.txt parse algorithm.
|
|
I want to thank everybody who gave me feedback, bug reports and
|
|
suggestions.
|
|
|
|
|
|
Versioning
|
|
----------
|
|
Version numbers have the same meaning as Linux Kernel version numbers.
|
|
The first number is the major package version. The second number is
|
|
the minor package version. An odd second number stands for development
|
|
versions, an even number for stable version. The third number is a
|
|
package release sequence number.
|
|
So for example 1.1.5 is the fifth release of the 1.1 development package.
|
|
|
|
|
|
Included packages
|
|
-----------------
|
|
httplib from http://www.lyra.org/greg/python/
|
|
httpslib from http://home.att.net/~nvsoft1/ssl_wrapper.html
|
|
DNS see DNS/README
|
|
fcgi.py and sz_fcgi.py from http://saarland.sz-sb.de/~ajung/sz_fcgi/
|
|
fintl.py from http://sourceforge.net/snippet/detail.php?type=snippet&id=100059
|
|
|
|
Note that the following packages are modified by me:
|
|
httplib.py (renamed to http11lib.py and a bug fixed)
|
|
fcgi.py (implemented immediate output)
|
|
sz_fcgi.py (simplified the code)
|
|
|
|
|
|
Internationalization
|
|
--------------------
|
|
For german output execute "export LC_MESSAGES=de" in bash and
|
|
"setenv LC_MESSAGES de" in tcsh.
|
|
Under Windows, execute "set LC_MESSAGES=de".
|
|
|
|
Code design
|
|
-----------
|
|
Only if you want to hack on the code.
|
|
|
|
(1) Look at the linkchecker script. This thing just reads all the
|
|
commandline options and stores them in a Config object.
|
|
|
|
(2) Which leads us directly to the Config class. This class stores all
|
|
options and works a little magic: it tries to find out if your platform
|
|
supports threads. If so, they are enabled. If not, they are disabled.
|
|
Note: several functions are replaced with their non-threaded
|
|
equivalents if threading is disabled.
|
|
|
|
(3) The linkchecker script finally calls linkcheck.checkUrls(), which
|
|
calls linkcheck.Config.checkUrl(), which calls
|
|
linkcheck.UrlData.check().
|
|
An UrlData object represents a single URL with all attached data like
|
|
validity, check time and so on. These values are filled by the
|
|
UrlData.check() function.
|
|
Derived from the base class UrlData are the different URL types:
|
|
HttpUrlData for http:// links, MailtoUrlData for mailto: links and so on.
|
|
|
|
So UrlData defines the functions which are common for *all* URLs, and
|
|
the subclasses define functions needed for their URL type.
|
|
|
|
(4) Lets look at the output. Every output is defined in a Logger class.
|
|
Each logger has functions init(), newUrl() and endOfOutput().
|
|
You call init() once to initialize the Logger, newUrl() for each new URL
|
|
we checked and endOfOutput() when all URLs are checked. Easy.
|