mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-04-20 22:31:00 +00:00
added
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1415 e7d03fd6-7b0d-0410-9947-9c21f3af8025
This commit is contained in:
parent
e49c6f2fcb
commit
0c1a7054d8
3 changed files with 373 additions and 0 deletions
168
doc/faq.txt
Normal file
168
doc/faq.txt
Normal file
|
|
@ -0,0 +1,168 @@
|
|||
Q1: LinkChecker produced an error, but my web page is ok with
|
||||
Netscape/IE/Opera/...
|
||||
Is this a bug in LinkChecker?
|
||||
A1: Please check your web pages first. Are they really ok? Use
|
||||
a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
|
||||
Check if you are using a proxy which produces the error.
|
||||
|
||||
|
||||
Q2.1: I still get an error, but the page is definitely ok.
|
||||
A2: Some servers deny access of automated tools (also called robots)
|
||||
like LinkChecker. This is not a bug in LinkChecker but rather a
|
||||
policy by the webmaster running the website you are checking.
|
||||
It might even be possible for a website to send robots different
|
||||
web pages than normal browsers.
|
||||
|
||||
|
||||
Q3: How can I tell LinkChecker which proxy to use?
|
||||
A3: LinkChecker works transparently with proxies. In a Unix or Windows
|
||||
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
|
||||
environment variables to a URL that identifies the proxy server before
|
||||
starting LinkChecker. For example
|
||||
# http_proxy="http://www.someproxy.com:3128"
|
||||
# export http_proxy
|
||||
|
||||
In a Macintosh environment, LinkChecker will retrieve proxy information
|
||||
from Internet Config.
|
||||
|
||||
|
||||
Q4: The link "mailto:john@company.com?subject=Hello John" is reported
|
||||
as an error.
|
||||
A4: You have to quote special characters (e.g. spaces) in the subject field.
|
||||
The correct link should be "mailto:...?subject=Hello%20John"
|
||||
Unfortunately browsers like IE and Netscape do not enforce this.
|
||||
|
||||
|
||||
Q5: Has LinkChecker JavaScript support?
|
||||
A5: No, it never will. If your page is not working without JS then your
|
||||
web design is broken.
|
||||
Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
|
||||
an addon for your web pages.
|
||||
|
||||
|
||||
Q6: I have a pretty large site to check. How can I restrict link checking
|
||||
to check only my own pages?
|
||||
A6: Look at the options --intern, --extern, --strict, --denyallow and
|
||||
--recursion-level.
|
||||
|
||||
|
||||
Q7: I don't get this --extern/--intern stuff.
|
||||
A7: When it comes to checking there are three types of URLs:
|
||||
1) strict external URLs:
|
||||
We do only syntax checking. Internal URLs are never strict.
|
||||
2) external URLs:
|
||||
Like 1), but we additionally check if they are valid by connect()ing
|
||||
to them
|
||||
3) internal URLs:
|
||||
Like 2), but we additionally check if they are HTML pages and if so,
|
||||
we descend recursively into this link and check all the links in the
|
||||
HTML content.
|
||||
The --recursion-level option restricts the number of such recursive
|
||||
descends.
|
||||
|
||||
LinkChecker provides four options which affect URLs to fall in one
|
||||
of those three categories: --intern, --extern, --strict and
|
||||
--denyallow.
|
||||
By default all URLs are internal. With --extern you specify what URLs
|
||||
are external. With --intern you specify what URLs are internal.
|
||||
Now imagine you have both --extern and --intern. What happens
|
||||
when an URL matches both patterns? Or when it matches none? In this
|
||||
situation the --denyallow option specifies the order in which we match
|
||||
the URL. By default it is internal/external, with --denyallow the order is
|
||||
external/internal. Either way, the first match counts, and if none matches,
|
||||
the last checked category is the category for the URL.
|
||||
Finally, with --strict all external URLs are strict.
|
||||
|
||||
Oh, and just to boggle your mind: you can have more than one external
|
||||
regular expression in a config file and for each of those expressions
|
||||
you can specify if those matched external URLs should be strict or not.
|
||||
|
||||
An example. Assume we want to check only urls of our domains named
|
||||
'mydomain.com' and 'myotherdomain.com'. Then we specify
|
||||
-i'^http://my(other)?domain\.com' as internal regular expression, all other
|
||||
urls are treated external. Easy.
|
||||
|
||||
Another example. We don't want to check mailto urls. Then its
|
||||
-i'!^mailto:'. The '!' negates an expression. With --strict, we don't
|
||||
even connect to any mail hosts.
|
||||
|
||||
Yet another example. We check our site www.mycompany.com, don't recurse
|
||||
into external links point outside from our site and want to ignore links to
|
||||
hollowood.com and hullabulla.com completely.
|
||||
This can only be done with a configuration entry like
|
||||
[filtering]
|
||||
extern1=hollowood.com 1
|
||||
extern2=hullabulla.com 1
|
||||
# the 1 means strict external ie don't even connect
|
||||
and the command
|
||||
linkchecker --intern=www.mycompany.com www.mycompany.com
|
||||
|
||||
|
||||
Q8: Is LinkCheckers cookie feature insecure?
|
||||
A8: Cookies can not store more information as is in the HTTP request itself,
|
||||
so you are not giving away any more system information.
|
||||
After storing however, the cookies are sent out to the server on request.
|
||||
Not to every server, but only to the one who the cookie originated from!
|
||||
This could be used to "track" subsequent requests to this server,
|
||||
and this is what some people annoys (including me).
|
||||
Cookies are only stored in memory. After LinkChecker finishes, they
|
||||
are lost. So the tracking is restricted to the checking time.
|
||||
The cookie feature is disabled as default.
|
||||
|
||||
|
||||
Q9: I want to have my own logging class. How can I use it in LinkChecker?
|
||||
A9: Currently, only a Python API lets you define new logging classes.
|
||||
Define your own logging class as a subclass of StandardLogger or any other
|
||||
logging class in the log module.
|
||||
Then call the addLogger function in Config.Configuration to register
|
||||
your new Logger.
|
||||
After this append a new Logging instance to the fileoutput.
|
||||
|
||||
import linkcheck, MyLogger
|
||||
log_format = 'mylog'
|
||||
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
|
||||
cfg = linkcheck.Config.Configuration()
|
||||
cfg.addLogger(log_format, MyLogger.MyLogger)
|
||||
cfg['fileoutput'].append(cfg.newLogger(log_format, log_args))
|
||||
|
||||
|
||||
Q10.1: LinkChecker does not ignore anchor references on caching.
|
||||
Q10.2: Some links with anchors are getting checked twice.
|
||||
A10: This is not a bug.
|
||||
It is common practice to believe that if an URL ABC#anchor1 works then
|
||||
ABC#anchor2 works too. That is not specified anywhere and I have seen
|
||||
server-side scripts that fail on some anchors and not on others.
|
||||
This is the reason for always checking URLs with different anchors.
|
||||
If you really want to disable this, use --no-anchor-caching.
|
||||
|
||||
|
||||
Q11: I see LinkChecker gets a "/robots.txt" file for every site it
|
||||
checks. What is that about?
|
||||
A11: LinkChecker follows the robots.txt exclusion standard. To avoid
|
||||
misuse of LinkChecker, you cannot turn this feature off.
|
||||
See http://www.robotstxt.org/wc/robots.html and
|
||||
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
|
||||
for more info.
|
||||
|
||||
|
||||
Q12: Ctrl-C does not stop LinkChecker immediately. Why is that so?
|
||||
A12: The Python interpreter has to wait for all threads to finish, and
|
||||
this means waiting for all open sockets to close. The default timeout
|
||||
for sockets is 30 seconds, hence the delay.
|
||||
You can change the default socket timeout with the --timeout option.
|
||||
This is a list of things LinkChecker will *not* do for you.
|
||||
|
||||
1) Support JavaScript
|
||||
See the FAQ, question Q5.
|
||||
|
||||
2) Print unreachable/dead documents of your website.
|
||||
This would require
|
||||
- file system access to your web repository
|
||||
- access to your web server configuration
|
||||
|
||||
You can instead store the linkchecker results in a database
|
||||
and look for missing files.
|
||||
|
||||
3) HTML/XML syntax checking
|
||||
Use the HTML tidy program from http://tidy.sourceforge.net/ .
|
||||
|
||||
113
doc/index.txt
Normal file
113
doc/index.txt
Normal file
|
|
@ -0,0 +1,113 @@
|
|||
LinkChecker
|
||||
=============
|
||||
|
||||
LinkChecker checks HTML documents for broken links.
|
||||
|
||||
It features
|
||||
o recursive checking
|
||||
o multithreading
|
||||
o output in colored or normal text, HTML, SQL, CSV or a sitemap
|
||||
graph in GML or XML.
|
||||
o HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Gopher, Telnet and local
|
||||
file links support
|
||||
o restriction of link checking with regular expression filters for URLs
|
||||
o proxy support
|
||||
o username/password authorization for HTTP and FTP
|
||||
o robots.txt exclusion protocol support
|
||||
o i18n support
|
||||
o a command line interface
|
||||
o a (Fast)CGI web interface (requires HTTP server)
|
||||
|
||||
|
||||
Installing and Requirements
|
||||
---------------------------
|
||||
Read the file INSTALL.
|
||||
|
||||
|
||||
Running the program
|
||||
-------------------
|
||||
o Unix or Mac OS X platforms
|
||||
The local configuration file is $HOME/.linkcheckerrc
|
||||
Type "linkchecker" followed by your URLs you want to check.
|
||||
Type "linkchecker -h" for help.
|
||||
|
||||
o Windows platforms
|
||||
Double-click on "linkchecker.bat" on your desktop.
|
||||
URL input is interactive.
|
||||
Another way is executing "python.exe linkchecker" in the Python
|
||||
Scripts directory.
|
||||
|
||||
o Mac OS 9.x platforms
|
||||
Read the MacOS Python documentation to find out about passing
|
||||
commandline options to Python scripts.
|
||||
|
||||
|
||||
License and Credits
|
||||
-------------------
|
||||
LinkChecker is licensed under the GNU Public License.
|
||||
Credits go to Guido van Rossum and his team for making Python.
|
||||
His hovercraft is full of eels!
|
||||
As this program is directly derived from my Java link checker, additional
|
||||
credits go to Robert Forsman (the author of JCheckLinks) and his
|
||||
robots.txt parse algorithm.
|
||||
Nicolas Chauvat <Nicolas.Chauvat@logilab.fr> supplied a patch for
|
||||
an XML output logger.
|
||||
I want to thank everybody who gave me feedback, bug reports and
|
||||
suggestions.
|
||||
|
||||
|
||||
Versioning
|
||||
----------
|
||||
Version numbers have the same meaning as Linux Kernel version numbers.
|
||||
The first number is the major package version. The second number is
|
||||
the minor package version. An odd second number stands for development
|
||||
versions, an even number for stable version. The third number is a
|
||||
package release sequence number.
|
||||
So for example 1.1.5 is the fifth release of the 1.1 development package.
|
||||
|
||||
|
||||
Included packages
|
||||
-----------------
|
||||
fcgi.py and sz_fcgi.py from Andreas Jung (http://www.andreas-jung.com/privat.html)
|
||||
Note that included packages are modified by me.
|
||||
|
||||
|
||||
Internationalization
|
||||
--------------------
|
||||
For german output execute "export LC_MESSAGES=de" in bash or
|
||||
"setenv LC_MESSAGES de" in tcsh.
|
||||
Under Windows, execute "set LC_MESSAGES=de".
|
||||
Other supported languages are 'nl' (Nederlands) and 'fr' (français).
|
||||
If you want to help me translate LinkChecker, copy the linkchecker.pot
|
||||
file to <your language>.po and send me the translated file.
|
||||
|
||||
|
||||
Code design
|
||||
-----------
|
||||
Only if you want to hack on the code.
|
||||
|
||||
(1) Look at the linkchecker script. This thing just reads all the
|
||||
commandline options and stores them in a Config object.
|
||||
|
||||
(2) Which leads us directly to the Config class. This class stores all
|
||||
options and supports threading and reading config files.
|
||||
A Config object reads config file options on initialization so they get
|
||||
handled before any commandline options.
|
||||
|
||||
(3) The linkchecker script calls linkcheck.checkUrls(), which
|
||||
calls linkcheck.Config.checkUrl(), which calls linkcheck.UrlData.check().
|
||||
An UrlData object represents a single URL with all attached data like
|
||||
validity, check time and so on. These values are filled by the
|
||||
UrlData.check() function.
|
||||
Derived from the base class UrlData are the different URL types:
|
||||
HttpUrlData for http:// links, MailtoUrlData for mailto: links, etc.
|
||||
|
||||
UrlData defines the functions which are common for *all* URLs, and
|
||||
the subclasses define functions needed for their URL type.
|
||||
|
||||
(4) Lets look at the output. Every output is defined in a Logger class.
|
||||
Each logger has functions init(), newUrl() and endOfOutput().
|
||||
We call init() once to initialize the Logger. UrlData.check() calls
|
||||
newUrl() (through UrlData.logMe()) for each new URL and after all
|
||||
checking is done we call endOfOutput(). Easy.
|
||||
New loggers are created with the Config.newLogger function.
|
||||
92
doc/install.txt
Normal file
92
doc/install.txt
Normal file
|
|
@ -0,0 +1,92 @@
|
|||
Installation
|
||||
==============
|
||||
|
||||
Requirements for Unix/Linux or Mac OS X
|
||||
--------------------------------------
|
||||
1. You need a standard GNU development environment with
|
||||
a) C compiler (for example the GNU C Compiler gcc)
|
||||
b) gettext
|
||||
2. Python >= 2.3 from http://www.python.org/ with zlib support
|
||||
|
||||
|
||||
Requirements for Windows
|
||||
------------------------
|
||||
Direct download links are in brackets.
|
||||
1. Install the MinGW suite from http://mingw.sourceforge.net.
|
||||
Be sure to install in the given order:
|
||||
a) MingGW
|
||||
[ http://osdn.dl.sourceforge.net/sourceforge/mingw/MinGW-3.1.0-1.exe ]
|
||||
b) MSYS
|
||||
[ http://osdn.dl.sourceforge.net/sourceforge/mingw/MSYS-1.0.10.exe ]
|
||||
c) libiconv
|
||||
[ http://osdn.dl.sourceforge.net/sourceforge/mingw/libiconv-1.8.0-2003.02.01-1.exe ]
|
||||
d) gettext
|
||||
[ http://osdn.dl.sourceforge.net/sourceforge/mingw/gettext-0.11.5-2003.02.01-1.exe ]
|
||||
2. Install Python >= 2.3 from http://www.python.org/
|
||||
[ http://www.python.org/ftp/python/2.3.3/Python-2.3.3.exe ]
|
||||
|
||||
|
||||
Setup for Unix/Linux or Mac OS X
|
||||
-------------------------------
|
||||
0. Help note
|
||||
Run "python setup.py --help" for help about command options
|
||||
1. Compile Python modules
|
||||
Run "python setup.py build" to compile the Python files.
|
||||
The CC environment variable is checked before compilation, so you can
|
||||
change the default C compiler with "export CC=myccompiler".
|
||||
2. Install Python modules
|
||||
Run "python setup.py install" to install everything.
|
||||
|
||||
|
||||
Setup for Windows
|
||||
-----------------
|
||||
0. Install check
|
||||
Be sure to have installed all required software listed above.
|
||||
1. Preparing Python for the MinGW compiler
|
||||
Search the file python23.dll in your windows folder.
|
||||
After you found it, launch MSYS. Then change into your windows folder,
|
||||
for example with:
|
||||
% cd c:\winnt\system32
|
||||
After that execute:
|
||||
% pexports python23.dll > python23.def
|
||||
Then use the dlltool
|
||||
% dlltool --dllname python23.dll --def python23.def --output-lib libpython23.a
|
||||
The resulting library has to be placed in the same directory as
|
||||
python23.lib. (Should be the libs directory under your Python installation
|
||||
directory.)
|
||||
2. Compile gettext translations
|
||||
Change to the linkchecker-X.X.X/po directory:
|
||||
% make win
|
||||
3. Compile and install the Python modules
|
||||
Start a DOS command shell and change to the linkchecker-X.X.X directory:
|
||||
c:> python setup.py build -c mingw32 install
|
||||
|
||||
|
||||
Installation for other platforms
|
||||
--------------------------------
|
||||
If you happen to install LinkChecker on other platforms (for example
|
||||
Mac OS 9.x) then drop me a note.
|
||||
|
||||
|
||||
(Fast)CGI web interface
|
||||
-----------------------
|
||||
The *cgi files are three CGI scripts which you can use to run LinkChecker
|
||||
with a nice graphical web interface.
|
||||
You can use and adjust the example HTML files in the lconline directory
|
||||
to run the script.
|
||||
1. Choose a CGI script. The simplest is lc.cgi and you need a web server
|
||||
with CGI support.
|
||||
The scripts lc.fcgi (I tested this a while ago) and lc.sz_fcgi
|
||||
(untested) need a web server with FastCGI support.
|
||||
2. Copy the script of your choice in the CGI directory.
|
||||
3. Adjust the "action=..." parameter in lconline/lc_cgi.html
|
||||
to point to your CGI script.
|
||||
4. load the lconline/index.html file, enter an URL and klick on the
|
||||
check button
|
||||
5. If something goes wrong, check the following:
|
||||
a) look in the error log of your web server
|
||||
b) be sure that you have enabled CGI support in your web server
|
||||
do this by running other CGI scripts from which you know that
|
||||
they are working
|
||||
c) try to run the lc.cgi script by hand
|
||||
d) try the testit() function in the lc.cgi script
|
||||
Loading…
Reference in a new issue