Hopiu/linkchecker: check links in web documents or full websites

mirror of https://github.com/Hopiu/linkchecker.git synced 2026-05-08 22:54:51 +00:00

check links in web documents or full websites

Find a file

calvin 9c33d3f0f1 setup.py stuff git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@74 e7d03fd6-7b0d-0410-9947-9c21f3af8025		2000-04-24 23:56:48 +00:00
debian	See changelog	2000-04-24 22:07:48 +00:00
DNS	move some files	2000-03-22 01:15:55 +00:00
GML	Added some grammar files	2000-02-28 13:49:21 +00:00
linkcheck	See changelog	2000-04-24 22:07:48 +00:00
PyLR	cleaned files	2000-02-28 20:42:54 +00:00
test	news test	2000-04-06 22:47:09 +00:00
tests	Initial revision	2000-02-26 10:24:46 +00:00
.cvsignore	move some files	2000-03-22 01:15:55 +00:00
create.sql	Initial revision	2000-02-26 10:24:46 +00:00
fcgi.py	See Changelog	2000-03-25 01:11:00 +00:00
http11lib.py	Netscape SErver fix	2000-04-07 20:16:07 +00:00
httpslib.py	HTTPS fixes	2000-02-29 12:59:27 +00:00
INSTALL	See changelog	2000-04-24 22:07:48 +00:00
lc.cgi	See changelog	2000-04-24 22:07:48 +00:00
lc.fcgi	See ChangeLog	2000-03-26 18:53:23 +00:00
lc.sz_fcgi	See ChangeLog	2000-03-26 18:53:23 +00:00
LICENSE	Initial revision	2000-02-26 10:24:46 +00:00
linkchecker	See changelog	2000-04-24 22:07:48 +00:00
linkchecker.bat	See changelog	2000-04-24 22:07:48 +00:00
linkcheckerrc	news: link support	2000-03-30 17:10:35 +00:00
Makefile	See changelog	2000-04-24 22:07:48 +00:00
MANIFEST.in	setup.py stuff	2000-04-24 23:56:48 +00:00
README	See changelog	2000-04-24 22:07:48 +00:00
setup.py	setup.py stuff	2000-04-24 23:56:48 +00:00
ssl.c	HTTPS support	2000-02-29 12:53:00 +00:00
StringUtil.py	See changelog	2000-04-24 22:07:48 +00:00
sz_fcgi.py	See ChangeLog	2000-03-26 18:53:23 +00:00
Template.py	setup.py stuff	2000-04-24 23:56:48 +00:00
TODO	See changelog	2000-04-24 22:07:48 +00:00

README

                      LinkChecker
                     =============

With LinkChecker you can check your HTML documents for broken links.
Features:
o recursive checking
o multithreaded
o output can be colored or normal text, HTML, SQL or a GML sitemap graph
o HTTP/1.1, HTTPS, FTP, mailto:, news:, Gopher, Telnet and local file links 
  are supported
  Javascript links are currently ignored
o restrict link checking to your local domain
o HTTP proxy support
o give username/password for HTTP and FTP authorization
o robots.txt exclusion protocol support 

LinkChecker is licensed under the GNU Public License.
Credits go to Guido van Rossum for making Python. His hovercraft is
full of eels!
As this program is directly derived from my Java link checker, additional
credits go to Robert Forsman (the author of JCheckLinks) and his
robots.txt parse algorithm.
I want to thank everybody who gave me feedback, bug reports and
suggestions.

Versioning:
Version numbers have the same meaning as Linux Kernel version numbers.
The first number is the major package version. The second number is
the minor package version. An odd second number stands for development
versions, an even number for stable version. The third number is a
package release sequence number.
So for example 1.1.5 is the fifth release of the 1.1 development package.

Included packages:
httplib from http://www.lyra.org/greg/python/
httpslib from http://home.att.net/~nvsoft1/ssl_wrapper.html
DNS see DNS/README
fcgi.py and sz_fcgi.py from http://saarland.sz-sb.de/~ajung/sz_fcgi/

Note that the following packages are modified by me:
httplib.py (renamed to http11lib.py)
fcgi.py
sz_fcgi.py


The big picture (if you want to hack on the code):

(1) Look at the linkchecker script. This thing just reads all the
commandline options and stores them in a Config object.

(2) Which leads us directly to the Config class. This class stores all
options and works a little magic: it tries to find out if your platform
supports threads. If so, they are enabled. If not, they are disabled.
Note: several functions are replaced with their non-threaded 
equivalents if threading is disabled.

(3) The linkchecker script finally calls linkcheck.checkUrls(), which
calls linkcheck.Config.checkUrl(), which calls 
linkcheck.UrlData.check().
An UrlData object represents a single URL with all attached data like
validity, check time and so on. These values are filled when the 
UrlData.check() function exits.
Derived from the base class UrlData are the different URL types: 
HttpUrlData for http:// links, MailtoUrlData for mailto: links and so on.

So UrlData defines the functions which are common for *all* URLs, and
the subclasses define functions needed for their URL type.

(4) Lets look at the output. Every output is defined in a Logger class.
Each logger has functions init(), newUrl() and endOfOutput().
You call init() once to initialize the Logger, newUrl() for each new URL
we checked and endOfOutput() when all URLs are checked. Easy.