check links in web documents or full websites
Find a file
2002-01-02 15:56:32 +00:00
debian po/Makefile cleanup 2002-01-02 15:54:00 +00:00
DNS new debian python policy 2001-11-16 11:03:56 +00:00
lconline debian package CGI fixes, -t0 when -P 2001-05-24 15:48:07 +00:00
linkcheck ignored some schemes 2001-12-10 13:51:07 +00:00
locale added .cvsignore in locale dirs 2001-11-30 11:26:18 +00:00
po po/Makefile cleanup 2002-01-02 15:54:00 +00:00
test updated online tests 2001-12-10 18:57:28 +00:00
.cvsignore add files 2001-04-13 11:40:24 +00:00
create.sql release 1.2.13 2001-01-07 13:28:38 +00:00
draft-gilman-news-url-00.txt files 2000-07-14 13:05:01 +00:00
FAQ see changelog 2001-08-01 20:31:43 +00:00
guruguru.bmp bitmap file for Distutils 1.0.2 installer 2001-05-25 20:13:08 +00:00
INSTALL Install doc updated 2001-11-29 17:03:46 +00:00
lc.cgi new debian python policy 2001-11-16 11:03:56 +00:00
lc.fcgi new debian python policy 2001-11-16 11:03:56 +00:00
lc.sz_fcgi new debian python policy 2001-11-16 11:03:56 +00:00
LICENSE Initial revision 2000-02-26 10:24:46 +00:00
linkchecker ignored some schemes 2001-12-10 13:51:07 +00:00
linkchecker.1 interactive input 2001-11-16 17:06:42 +00:00
linkchecker.bat updated tests 2001-11-29 13:49:52 +00:00
linkcheckerrc .po updates 2001-11-17 13:02:22 +00:00
Makefile gpg key corrected 2002-01-02 15:56:32 +00:00
MANIFEST.in use ssh2 2001-10-16 18:20:23 +00:00
norobots-rfc.html robotparser 2001-01-05 11:42:11 +00:00
README doc updates 2001-11-20 20:37:03 +00:00
rpm_build_script devel changes 2001-01-22 23:02:54 +00:00
setup.cfg prerelease 2001-04-28 21:40:49 +00:00
setup.py po/Makefile cleanup 2002-01-02 15:54:00 +00:00
TODO doc fixes 2001-08-23 15:18:05 +00:00

                      LinkChecker
                     =============

LinkChecker checks HTML documents for broken links.

It features
o recursive checking
o multithreading
o output in colored or normal text, HTML, SQL, CSV or a sitemap
  graph in GML or XML.
o HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Gopher, Telnet and local
  file links support
o restriction of link checking with regular expression filters for URLs
o proxy support
o username/password authorization for HTTP and FTP
o robots.txt exclusion protocol support
o i18n support
o a command line interface
o a (Fast)CGI web interface (requires HTTP server)


Installing, Requirements, Running
---------------------------------
Read the file INSTALL.


License and Credits
-------------------
LinkChecker is licensed under the GNU Public License.
Credits go to Guido van Rossum for making Python. His hovercraft is
full of eels!
As this program is directly derived from my Java link checker, additional
credits go to Robert Forsman (the author of JCheckLinks) and his
robots.txt parse algorithm.
Nicolas Chauvat <Nicolas.Chauvat@logilab.fr> supplied a patch for
an XML output logger.
I want to thank everybody who gave me feedback, bug reports and
suggestions.


Versioning
----------
Version numbers have the same meaning as Linux Kernel version numbers.
The first number is the major package version. The second number is
the minor package version. An odd second number stands for development
versions, an even number for stable version. The third number is a
package release sequence number.
So for example 1.1.5 is the fifth release of the 1.1 development package.


Included packages
-----------------
DNS from http://pydns.sourceforge.net/
fcgi.py and sz_fcgi.py from http://saarland.sz-sb.de/~ajung/sz_fcgi/
fintl.py from http://sourceforge.net/snippet/detail.php?type=snippet&id=100059

Note that all included packages are modified by me.


Internationalization
--------------------
For german output execute "export LC_MESSAGES=de" in bash or
"setenv LC_MESSAGES de" in tcsh.
Under Windows, execute "set LC_MESSAGES=de".
For french output use 'fr' instead of 'de'.


Code design
-----------
Only if you want to hack on the code.

(1) Look at the linkchecker script. This thing just reads all the
commandline options and stores them in a Config object.

(2) Which leads us directly to the Config class. This class stores all
options and works a little magic: it tries to find out if your platform
supports threads. If so, threading is enabled. If not, it is disabled.
Several functions are replaced with their threaded equivalents if 
threading is enabled.
Another thing are config files. A Config object reads config file options
on initialization so they get handled before any commandline options.

(3) The linkchecker script finally calls linkcheck.checkUrls(), which
calls linkcheck.Config.checkUrl(), which calls linkcheck.UrlData.check().
An UrlData object represents a single URL with all attached data like
validity, check time and so on. These values are filled by the 
UrlData.check() function.
Derived from the base class UrlData are the different URL types: 
HttpUrlData for http:// links, MailtoUrlData for mailto: links, etc.

UrlData defines the functions which are common for *all* URLs, and
the subclasses define functions needed for their URL type.

(4) Lets look at the output. Every output is defined in a Logger class.
Each logger has functions init(), newUrl() and endOfOutput().
We call init() once to initialize the Logger. UrlData.check() calls
newUrl() (through UrlData.logMe()) for each new URL and after all 
checking is done we call endOfOutput(). Easy.
New loggers are created with the Config.newLogger function.