mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-22 17:00:25 +00:00
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@214 e7d03fd6-7b0d-0410-9947-9c21f3af8025
107 lines
4 KiB
Text
107 lines
4 KiB
Text
LinkChecker
|
|
=============
|
|
|
|
LinkChecker checks HTML documents for broken links.
|
|
|
|
It features
|
|
o recursive checking
|
|
o multithreading
|
|
o output in colored or normal text, HTML, SQL, CSV or a sitemap
|
|
graph in GML or XML.
|
|
o HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Gopher, Telnet and local
|
|
file links support
|
|
o restriction of link checking with regular expression filters for URLs
|
|
o proxy support
|
|
o username/password authorization for HTTP and FTP
|
|
o robots.txt exclusion protocol support
|
|
o i18n support
|
|
o a command line interface
|
|
o a (Fast)CGI web interface (requires HTTP server)
|
|
|
|
|
|
Installing, Requirements, Running
|
|
---------------------------------
|
|
Read the file INSTALL.
|
|
|
|
|
|
License and Credits
|
|
-------------------
|
|
LinkChecker is licensed under the GNU Public License.
|
|
Credits go to Guido van Rossum for making Python. His hovercraft is
|
|
full of eels!
|
|
As this program is directly derived from my Java link checker, additional
|
|
credits go to Robert Forsman (the author of JCheckLinks) and his
|
|
robots.txt parse algorithm.
|
|
Nicolas Chauvat <Nicolas.Chauvat@logilab.fr> supplied a patch for
|
|
an XML output logger.
|
|
I want to thank everybody who gave me feedback, bug reports and
|
|
suggestions.
|
|
|
|
|
|
Versioning
|
|
----------
|
|
Version numbers have the same meaning as Linux Kernel version numbers.
|
|
The first number is the major package version. The second number is
|
|
the minor package version. An odd second number stands for development
|
|
versions, an even number for stable version. The third number is a
|
|
package release sequence number.
|
|
So for example 1.1.5 is the fifth release of the 1.1 development package.
|
|
|
|
|
|
Included packages
|
|
-----------------
|
|
httplib from http://www.lyra.org/greg/python/
|
|
httpslib from http://home.att.net/~nvsoft1/ssl_wrapper.html
|
|
DNS see DNS/README
|
|
fcgi.py and sz_fcgi.py from http://saarland.sz-sb.de/~ajung/sz_fcgi/
|
|
fintl.py from http://sourceforge.net/snippet/detail.php?type=snippet&id=100059
|
|
|
|
Note that the following packages are modified by me:
|
|
httplib.py (renamed to http11lib.py and a bug fixed)
|
|
fcgi.py (implemented streamed output)
|
|
sz_fcgi.py (simplified the code)
|
|
DNS/Lib.py:566 fixed rdlength name error
|
|
DNS/Lib.py:105 tuple parameter for Python 1.6 compatibility
|
|
DNS/Base.py: fixed /etc/resolv.conf parser to cope with empty lines
|
|
|
|
|
|
Internationalization
|
|
--------------------
|
|
For german output execute "export LC_MESSAGES=de" in bash or
|
|
"setenv LC_MESSAGES de" in tcsh.
|
|
Under Windows, execute "set LC_MESSAGES=de".
|
|
For french output use 'fr' instead of 'de'.
|
|
|
|
|
|
Code design
|
|
-----------
|
|
Only if you want to hack on the code.
|
|
|
|
(1) Look at the linkchecker script. This thing just reads all the
|
|
commandline options and stores them in a Config object.
|
|
|
|
(2) Which leads us directly to the Config class. This class stores all
|
|
options and works a little magic: it tries to find out if your platform
|
|
supports threads. If so, threading is enabled. If not, it is disabled.
|
|
Several functions are replaced with their threaded equivalents if
|
|
threading is enabled.
|
|
Another thing are config files. A Config object reads config file options
|
|
on initialization so they get handled before any commandline options.
|
|
|
|
(3) The linkchecker script finally calls linkcheck.checkUrls(), which
|
|
calls linkcheck.Config.checkUrl(), which calls linkcheck.UrlData.check().
|
|
An UrlData object represents a single URL with all attached data like
|
|
validity, check time and so on. These values are filled by the
|
|
UrlData.check() function.
|
|
Derived from the base class UrlData are the different URL types:
|
|
HttpUrlData for http:// links, MailtoUrlData for mailto: links, etc.
|
|
|
|
UrlData defines the functions which are common for *all* URLs, and
|
|
the subclasses define functions needed for their URL type.
|
|
|
|
(4) Lets look at the output. Every output is defined in a Logger class.
|
|
Each logger has functions init(), newUrl() and endOfOutput().
|
|
We call init() once to initialize the Logger. UrlData.check() calls
|
|
newUrl() (through UrlData.logMe()) for each new URL and after all
|
|
checking is done we call endOfOutput(). Easy.
|
|
New loggers are created with the Config.newLogger function.
|