git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1415 e7d03fd6-7b0d-0410-9947-9c21f3af8025
This commit is contained in:
calvin 2004-08-16 19:09:37 +00:00
parent e49c6f2fcb
commit 0c1a7054d8
3 changed files with 373 additions and 0 deletions

168
doc/faq.txt Normal file
View file

@ -0,0 +1,168 @@
Q1: LinkChecker produced an error, but my web page is ok with
Netscape/IE/Opera/...
Is this a bug in LinkChecker?
A1: Please check your web pages first. Are they really ok? Use
a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
Check if you are using a proxy which produces the error.
Q2.1: I still get an error, but the page is definitely ok.
A2: Some servers deny access of automated tools (also called robots)
like LinkChecker. This is not a bug in LinkChecker but rather a
policy by the webmaster running the website you are checking.
It might even be possible for a website to send robots different
web pages than normal browsers.
Q3: How can I tell LinkChecker which proxy to use?
A3: LinkChecker works transparently with proxies. In a Unix or Windows
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
environment variables to a URL that identifies the proxy server before
starting LinkChecker. For example
# http_proxy="http://www.someproxy.com:3128"
# export http_proxy
In a Macintosh environment, LinkChecker will retrieve proxy information
from Internet Config.
Q4: The link "mailto:john@company.com?subject=Hello John" is reported
as an error.
A4: You have to quote special characters (e.g. spaces) in the subject field.
The correct link should be "mailto:...?subject=Hello%20John"
Unfortunately browsers like IE and Netscape do not enforce this.
Q5: Has LinkChecker JavaScript support?
A5: No, it never will. If your page is not working without JS then your
web design is broken.
Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
an addon for your web pages.
Q6: I have a pretty large site to check. How can I restrict link checking
to check only my own pages?
A6: Look at the options --intern, --extern, --strict, --denyallow and
--recursion-level.
Q7: I don't get this --extern/--intern stuff.
A7: When it comes to checking there are three types of URLs:
1) strict external URLs:
We do only syntax checking. Internal URLs are never strict.
2) external URLs:
Like 1), but we additionally check if they are valid by connect()ing
to them
3) internal URLs:
Like 2), but we additionally check if they are HTML pages and if so,
we descend recursively into this link and check all the links in the
HTML content.
The --recursion-level option restricts the number of such recursive
descends.
LinkChecker provides four options which affect URLs to fall in one
of those three categories: --intern, --extern, --strict and
--denyallow.
By default all URLs are internal. With --extern you specify what URLs
are external. With --intern you specify what URLs are internal.
Now imagine you have both --extern and --intern. What happens
when an URL matches both patterns? Or when it matches none? In this
situation the --denyallow option specifies the order in which we match
the URL. By default it is internal/external, with --denyallow the order is
external/internal. Either way, the first match counts, and if none matches,
the last checked category is the category for the URL.
Finally, with --strict all external URLs are strict.
Oh, and just to boggle your mind: you can have more than one external
regular expression in a config file and for each of those expressions
you can specify if those matched external URLs should be strict or not.
An example. Assume we want to check only urls of our domains named
'mydomain.com' and 'myotherdomain.com'. Then we specify
-i'^http://my(other)?domain\.com' as internal regular expression, all other
urls are treated external. Easy.
Another example. We don't want to check mailto urls. Then its
-i'!^mailto:'. The '!' negates an expression. With --strict, we don't
even connect to any mail hosts.
Yet another example. We check our site www.mycompany.com, don't recurse
into external links point outside from our site and want to ignore links to
hollowood.com and hullabulla.com completely.
This can only be done with a configuration entry like
[filtering]
extern1=hollowood.com 1
extern2=hullabulla.com 1
# the 1 means strict external ie don't even connect
and the command
linkchecker --intern=www.mycompany.com www.mycompany.com
Q8: Is LinkCheckers cookie feature insecure?
A8: Cookies can not store more information as is in the HTTP request itself,
so you are not giving away any more system information.
After storing however, the cookies are sent out to the server on request.
Not to every server, but only to the one who the cookie originated from!
This could be used to "track" subsequent requests to this server,
and this is what some people annoys (including me).
Cookies are only stored in memory. After LinkChecker finishes, they
are lost. So the tracking is restricted to the checking time.
The cookie feature is disabled as default.
Q9: I want to have my own logging class. How can I use it in LinkChecker?
A9: Currently, only a Python API lets you define new logging classes.
Define your own logging class as a subclass of StandardLogger or any other
logging class in the log module.
Then call the addLogger function in Config.Configuration to register
your new Logger.
After this append a new Logging instance to the fileoutput.
import linkcheck, MyLogger
log_format = 'mylog'
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
cfg = linkcheck.Config.Configuration()
cfg.addLogger(log_format, MyLogger.MyLogger)
cfg['fileoutput'].append(cfg.newLogger(log_format, log_args))
Q10.1: LinkChecker does not ignore anchor references on caching.
Q10.2: Some links with anchors are getting checked twice.
A10: This is not a bug.
It is common practice to believe that if an URL ABC#anchor1 works then
ABC#anchor2 works too. That is not specified anywhere and I have seen
server-side scripts that fail on some anchors and not on others.
This is the reason for always checking URLs with different anchors.
If you really want to disable this, use --no-anchor-caching.
Q11: I see LinkChecker gets a "/robots.txt" file for every site it
checks. What is that about?
A11: LinkChecker follows the robots.txt exclusion standard. To avoid
misuse of LinkChecker, you cannot turn this feature off.
See http://www.robotstxt.org/wc/robots.html and
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
for more info.
Q12: Ctrl-C does not stop LinkChecker immediately. Why is that so?
A12: The Python interpreter has to wait for all threads to finish, and
this means waiting for all open sockets to close. The default timeout
for sockets is 30 seconds, hence the delay.
You can change the default socket timeout with the --timeout option.
This is a list of things LinkChecker will *not* do for you.
1) Support JavaScript
See the FAQ, question Q5.
2) Print unreachable/dead documents of your website.
This would require
- file system access to your web repository
- access to your web server configuration
You can instead store the linkchecker results in a database
and look for missing files.
3) HTML/XML syntax checking
Use the HTML tidy program from http://tidy.sourceforge.net/ .

113
doc/index.txt Normal file
View file

@ -0,0 +1,113 @@
LinkChecker
=============
LinkChecker checks HTML documents for broken links.
It features
o recursive checking
o multithreading
o output in colored or normal text, HTML, SQL, CSV or a sitemap
graph in GML or XML.
o HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Gopher, Telnet and local
file links support
o restriction of link checking with regular expression filters for URLs
o proxy support
o username/password authorization for HTTP and FTP
o robots.txt exclusion protocol support
o i18n support
o a command line interface
o a (Fast)CGI web interface (requires HTTP server)
Installing and Requirements
---------------------------
Read the file INSTALL.
Running the program
-------------------
o Unix or Mac OS X platforms
The local configuration file is $HOME/.linkcheckerrc
Type "linkchecker" followed by your URLs you want to check.
Type "linkchecker -h" for help.
o Windows platforms
Double-click on "linkchecker.bat" on your desktop.
URL input is interactive.
Another way is executing "python.exe linkchecker" in the Python
Scripts directory.
o Mac OS 9.x platforms
Read the MacOS Python documentation to find out about passing
commandline options to Python scripts.
License and Credits
-------------------
LinkChecker is licensed under the GNU Public License.
Credits go to Guido van Rossum and his team for making Python.
His hovercraft is full of eels!
As this program is directly derived from my Java link checker, additional
credits go to Robert Forsman (the author of JCheckLinks) and his
robots.txt parse algorithm.
Nicolas Chauvat <Nicolas.Chauvat@logilab.fr> supplied a patch for
an XML output logger.
I want to thank everybody who gave me feedback, bug reports and
suggestions.
Versioning
----------
Version numbers have the same meaning as Linux Kernel version numbers.
The first number is the major package version. The second number is
the minor package version. An odd second number stands for development
versions, an even number for stable version. The third number is a
package release sequence number.
So for example 1.1.5 is the fifth release of the 1.1 development package.
Included packages
-----------------
fcgi.py and sz_fcgi.py from Andreas Jung (http://www.andreas-jung.com/privat.html)
Note that included packages are modified by me.
Internationalization
--------------------
For german output execute "export LC_MESSAGES=de" in bash or
"setenv LC_MESSAGES de" in tcsh.
Under Windows, execute "set LC_MESSAGES=de".
Other supported languages are 'nl' (Nederlands) and 'fr' (français).
If you want to help me translate LinkChecker, copy the linkchecker.pot
file to <your language>.po and send me the translated file.
Code design
-----------
Only if you want to hack on the code.
(1) Look at the linkchecker script. This thing just reads all the
commandline options and stores them in a Config object.
(2) Which leads us directly to the Config class. This class stores all
options and supports threading and reading config files.
A Config object reads config file options on initialization so they get
handled before any commandline options.
(3) The linkchecker script calls linkcheck.checkUrls(), which
calls linkcheck.Config.checkUrl(), which calls linkcheck.UrlData.check().
An UrlData object represents a single URL with all attached data like
validity, check time and so on. These values are filled by the
UrlData.check() function.
Derived from the base class UrlData are the different URL types:
HttpUrlData for http:// links, MailtoUrlData for mailto: links, etc.
UrlData defines the functions which are common for *all* URLs, and
the subclasses define functions needed for their URL type.
(4) Lets look at the output. Every output is defined in a Logger class.
Each logger has functions init(), newUrl() and endOfOutput().
We call init() once to initialize the Logger. UrlData.check() calls
newUrl() (through UrlData.logMe()) for each new URL and after all
checking is done we call endOfOutput(). Easy.
New loggers are created with the Config.newLogger function.

92
doc/install.txt Normal file
View file

@ -0,0 +1,92 @@
Installation
==============
Requirements for Unix/Linux or Mac OS X
--------------------------------------
1. You need a standard GNU development environment with
a) C compiler (for example the GNU C Compiler gcc)
b) gettext
2. Python >= 2.3 from http://www.python.org/ with zlib support
Requirements for Windows
------------------------
Direct download links are in brackets.
1. Install the MinGW suite from http://mingw.sourceforge.net.
Be sure to install in the given order:
a) MingGW
[ http://osdn.dl.sourceforge.net/sourceforge/mingw/MinGW-3.1.0-1.exe ]
b) MSYS
[ http://osdn.dl.sourceforge.net/sourceforge/mingw/MSYS-1.0.10.exe ]
c) libiconv
[ http://osdn.dl.sourceforge.net/sourceforge/mingw/libiconv-1.8.0-2003.02.01-1.exe ]
d) gettext
[ http://osdn.dl.sourceforge.net/sourceforge/mingw/gettext-0.11.5-2003.02.01-1.exe ]
2. Install Python >= 2.3 from http://www.python.org/
[ http://www.python.org/ftp/python/2.3.3/Python-2.3.3.exe ]
Setup for Unix/Linux or Mac OS X
-------------------------------
0. Help note
Run "python setup.py --help" for help about command options
1. Compile Python modules
Run "python setup.py build" to compile the Python files.
The CC environment variable is checked before compilation, so you can
change the default C compiler with "export CC=myccompiler".
2. Install Python modules
Run "python setup.py install" to install everything.
Setup for Windows
-----------------
0. Install check
Be sure to have installed all required software listed above.
1. Preparing Python for the MinGW compiler
Search the file python23.dll in your windows folder.
After you found it, launch MSYS. Then change into your windows folder,
for example with:
% cd c:\winnt\system32
After that execute:
% pexports python23.dll > python23.def
Then use the dlltool
% dlltool --dllname python23.dll --def python23.def --output-lib libpython23.a
The resulting library has to be placed in the same directory as
python23.lib. (Should be the libs directory under your Python installation
directory.)
2. Compile gettext translations
Change to the linkchecker-X.X.X/po directory:
% make win
3. Compile and install the Python modules
Start a DOS command shell and change to the linkchecker-X.X.X directory:
c:> python setup.py build -c mingw32 install
Installation for other platforms
--------------------------------
If you happen to install LinkChecker on other platforms (for example
Mac OS 9.x) then drop me a note.
(Fast)CGI web interface
-----------------------
The *cgi files are three CGI scripts which you can use to run LinkChecker
with a nice graphical web interface.
You can use and adjust the example HTML files in the lconline directory
to run the script.
1. Choose a CGI script. The simplest is lc.cgi and you need a web server
with CGI support.
The scripts lc.fcgi (I tested this a while ago) and lc.sz_fcgi
(untested) need a web server with FastCGI support.
2. Copy the script of your choice in the CGI directory.
3. Adjust the "action=..." parameter in lconline/lc_cgi.html
to point to your CGI script.
4. load the lconline/index.html file, enter an URL and klick on the
check button
5. If something goes wrong, check the following:
a) look in the error log of your web server
b) be sure that you have enabled CGI support in your web server
do this by running other CGI scripts from which you know that
they are working
c) try to run the lc.cgi script by hand
d) try the testit() function in the lc.cgi script