added

git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1415 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2026-04-20 22:31:00 +00:00 · 2004-08-16 19:09:37 +00:00 · 2004-08-16 19:09:37 +00:00 · 0c1a7054d8
commit 0c1a7054d8
parent e49c6f2fcb
3 changed files with 373 additions and 0 deletions
--- a/doc/faq.txt
+++ b/doc/faq.txt
@ -0,0 +1,168 @@
+Q1: LinkChecker produced an error, but my web page is ok with
+   Netscape/IE/Opera/...
+   Is this a bug in LinkChecker?
+A1: Please check your web pages first. Are they really ok? Use
+   a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
+   Check if you are using a proxy which produces the error.
+
+
+Q2.1: I still get an error, but the page is definitely ok.
+A2: Some servers deny access of automated tools (also called robots)
+   like LinkChecker. This is not a bug in LinkChecker but rather a
+   policy by the webmaster running the website you are checking.
+   It might even be possible for a website to send robots different
+   web pages than normal browsers.
+
+
+Q3: How can I tell LinkChecker which proxy to use?
+A3: LinkChecker works transparently with proxies. In a Unix or Windows
+   environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
+   environment variables to a URL that identifies the proxy server before
+   starting LinkChecker. For example
+   # http_proxy="http://www.someproxy.com:3128"
+   # export http_proxy
+
+   In a Macintosh environment, LinkChecker will retrieve proxy information
+   from Internet Config.
+
+
+Q4: The link "mailto:john@company.com?subject=Hello John" is reported
+   as an error.
+A4: You have to quote special characters (e.g. spaces) in the subject field.
+   The correct link should be "mailto:...?subject=Hello%20John"
+   Unfortunately browsers like IE and Netscape do not enforce this.
+
+
+Q5: Has LinkChecker JavaScript support?
+A5: No, it never will. If your page is not working without JS then your
+   web design is broken.
+   Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
+   an addon for your web pages.
+
+
+Q6: I have a pretty large site to check. How can I restrict link checking
+   to check only my own pages?
+A6: Look at the options --intern, --extern, --strict, --denyallow  and
+   --recursion-level.
+
+
+Q7: I don't get this --extern/--intern stuff.
+A7: When it comes to checking there are three types of URLs:
+   1) strict external URLs:
+      We do only syntax checking. Internal URLs are never strict.
+   2) external URLs:
+      Like 1), but we additionally check if they are valid by connect()ing
+      to them
+   3) internal URLs:
+      Like 2), but we additionally check if they are HTML pages and if so,
+      we descend recursively into this link and check all the links in the
+      HTML content.
+      The --recursion-level option restricts the number of such recursive
+      descends.
+
+   LinkChecker provides four options which affect URLs to fall in one
+   of those three categories: --intern, --extern, --strict and
+   --denyallow.
+   By default all URLs are internal. With --extern you specify what URLs
+   are external. With --intern you specify what URLs are internal.
+   Now imagine you have both --extern and --intern. What happens
+   when an URL matches both patterns? Or when it matches none? In this
+   situation the --denyallow option specifies the order in which we match
+   the URL. By default it is internal/external, with --denyallow the order is
+   external/internal. Either way, the first match counts, and if none matches,
+   the last checked category is the category for the URL.
+   Finally, with --strict all external URLs are strict.
+   
+   Oh, and just to boggle your mind: you can have more than one external
+   regular expression in a config file and for each of those expressions
+   you can specify if those matched external URLs should be strict or not.
+
+   An example. Assume we want to check only urls of our domains named
+   'mydomain.com' and 'myotherdomain.com'. Then we specify
+   -i'^http://my(other)?domain\.com' as internal regular expression, all other
+   urls are treated external. Easy.
+
+   Another example. We don't want to check mailto urls. Then its
+   -i'!^mailto:'. The '!' negates an expression. With --strict, we don't
+   even connect to any mail hosts.
+   
+   Yet another example. We check our site www.mycompany.com, don't recurse
+   into external links point outside from our site and want to ignore links to
+   hollowood.com and hullabulla.com completely.
+   This can only be done with a configuration entry like
+   [filtering]
+   extern1=hollowood.com 1
+   extern2=hullabulla.com 1
+   # the 1 means strict external ie don't even connect
+   and the command
+   linkchecker --intern=www.mycompany.com www.mycompany.com
+
+
+Q8: Is LinkCheckers cookie feature insecure?
+A8: Cookies can not store more information as is in the HTTP request itself,
+   so you are not giving away any more system information.
+   After storing however, the cookies are sent out to the server on request.
+   Not to every server, but only to the one who the cookie originated from!
+   This could be used to "track" subsequent requests to this server,
+   and this is what some people annoys (including me).
+   Cookies are only stored in memory. After LinkChecker finishes, they
+   are lost. So the tracking is restricted to the checking time.
+   The cookie feature is disabled as default.
+
+
+Q9: I want to have my own logging class. How can I use it in LinkChecker?
+A9: Currently, only a Python API lets you define new logging classes.
+   Define your own logging class as a subclass of StandardLogger or any other
+   logging class in the log module.
+   Then call the addLogger function in Config.Configuration to register
+   your new Logger.
+   After this append a new Logging instance to the fileoutput.
+
+   import linkcheck, MyLogger
+   log_format = 'mylog'
+   log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
+   cfg = linkcheck.Config.Configuration()
+   cfg.addLogger(log_format, MyLogger.MyLogger)
+   cfg['fileoutput'].append(cfg.newLogger(log_format, log_args)) 
+
+
+Q10.1: LinkChecker does not ignore anchor references on caching.
+Q10.2: Some links with anchors are getting checked twice.
+A10: This is not a bug.
+   It is common practice to believe that if an URL ABC#anchor1 works then
+   ABC#anchor2 works too. That is not specified anywhere and I have seen
+   server-side scripts that fail on some anchors and not on others.
+   This is the reason for always checking URLs with different anchors.
+   If you really want to disable this, use --no-anchor-caching.
+
+
+Q11: I see LinkChecker gets a "/robots.txt" file for every site it
+     checks. What is that about?
+A11: LinkChecker follows the robots.txt exclusion standard. To avoid
+   misuse of LinkChecker, you cannot turn this feature off.
+   See http://www.robotstxt.org/wc/robots.html and
+   http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
+   for more info.
+
+
+Q12: Ctrl-C does not stop LinkChecker immediately. Why is that so?
+A12: The Python interpreter has to wait for all threads to finish, and
+   this means waiting for all open sockets to close. The default timeout
+   for sockets is 30 seconds, hence the delay.
+   You can change the default socket timeout with the --timeout option.
+This is a list of things LinkChecker will *not* do for you.
+
+1) Support JavaScript
+   See the FAQ, question Q5.
+
+2) Print unreachable/dead documents of your website.
+   This would require
+   - file system access to your web repository
+   - access to your web server configuration
+
+   You can instead store the linkchecker results in a database
+   and look for missing files.
+
+3) HTML/XML syntax checking
+   Use the HTML tidy program from http://tidy.sourceforge.net/ .
+
--- a/doc/index.txt
+++ b/doc/index.txt
@ -0,0 +1,113 @@
+                      LinkChecker
+                     =============
+
+LinkChecker checks HTML documents for broken links.
+
+It features
+o recursive checking
+o multithreading
+o output in colored or normal text, HTML, SQL, CSV or a sitemap
+  graph in GML or XML.
+o HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Gopher, Telnet and local
+  file links support
+o restriction of link checking with regular expression filters for URLs
+o proxy support
+o username/password authorization for HTTP and FTP
+o robots.txt exclusion protocol support
+o i18n support
+o a command line interface
+o a (Fast)CGI web interface (requires HTTP server)
+
+
+Installing and Requirements
+---------------------------
+Read the file INSTALL.
+
+
+Running the program
+-------------------
+o Unix or Mac OS X platforms
+  The local configuration file is $HOME/.linkcheckerrc
+  Type "linkchecker" followed by your URLs you want to check.
+  Type "linkchecker -h" for help.
+
+o Windows platforms
+  Double-click on "linkchecker.bat" on your desktop.
+  URL input is interactive.
+  Another way is executing "python.exe linkchecker" in the Python
+  Scripts directory.
+
+o Mac OS 9.x platforms
+  Read the MacOS Python documentation to find out about passing
+  commandline options to Python scripts.
+
+
+License and Credits
+-------------------
+LinkChecker is licensed under the GNU Public License.
+Credits go to Guido van Rossum and his team for making Python.
+His hovercraft is full of eels!
+As this program is directly derived from my Java link checker, additional
+credits go to Robert Forsman (the author of JCheckLinks) and his
+robots.txt parse algorithm.
+Nicolas Chauvat <Nicolas.Chauvat@logilab.fr> supplied a patch for
+an XML output logger.
+I want to thank everybody who gave me feedback, bug reports and
+suggestions.
+
+
+Versioning
+----------
+Version numbers have the same meaning as Linux Kernel version numbers.
+The first number is the major package version. The second number is
+the minor package version. An odd second number stands for development
+versions, an even number for stable version. The third number is a
+package release sequence number.
+So for example 1.1.5 is the fifth release of the 1.1 development package.
+
+
+Included packages
+-----------------
+fcgi.py and sz_fcgi.py from Andreas Jung (http://www.andreas-jung.com/privat.html)
+Note that included packages are modified by me.
+
+
+Internationalization
+--------------------
+For german output execute "export LC_MESSAGES=de" in bash or
+"setenv LC_MESSAGES de" in tcsh.
+Under Windows, execute "set LC_MESSAGES=de".
+Other supported languages are 'nl' (Nederlands) and 'fr' (français).
+If you want to help me translate LinkChecker, copy the linkchecker.pot
+file to <your language>.po and send me the translated file.
+
+
+Code design
+-----------
+Only if you want to hack on the code.
+
+(1) Look at the linkchecker script. This thing just reads all the
+commandline options and stores them in a Config object.
+
+(2) Which leads us directly to the Config class. This class stores all
+options and supports threading and reading config files.
+A Config object reads config file options on initialization so they get
+handled before any commandline options.
+
+(3) The linkchecker script calls linkcheck.checkUrls(), which
+calls linkcheck.Config.checkUrl(), which calls linkcheck.UrlData.check().
+An UrlData object represents a single URL with all attached data like
+validity, check time and so on. These values are filled by the
+UrlData.check() function.
+Derived from the base class UrlData are the different URL types: 
+HttpUrlData for http:// links, MailtoUrlData for mailto: links, etc.
+
+UrlData defines the functions which are common for *all* URLs, and
+the subclasses define functions needed for their URL type.
+
+(4) Lets look at the output. Every output is defined in a Logger class.
+Each logger has functions init(), newUrl() and endOfOutput().
+We call init() once to initialize the Logger. UrlData.check() calls
+newUrl() (through UrlData.logMe()) for each new URL and after all
+checking is done we call endOfOutput(). Easy.
+New loggers are created with the Config.newLogger function.
--- a/doc/install.txt
+++ b/doc/install.txt
@ -0,0 +1,92 @@
+                     Installation
+                    ==============
+
+Requirements for Unix/Linux or Mac OS X
+--------------------------------------
+1. You need a standard GNU development environment with
+   a) C compiler (for example the GNU C Compiler gcc)
+   b) gettext
+2. Python >= 2.3 from http://www.python.org/ with zlib support
+
+
+Requirements for Windows
+------------------------
+Direct download links are in brackets.
+1. Install the MinGW suite from http://mingw.sourceforge.net.
+   Be sure to install in the given order:
+   a) MingGW
+     [ http://osdn.dl.sourceforge.net/sourceforge/mingw/MinGW-3.1.0-1.exe ]
+   b) MSYS
+     [ http://osdn.dl.sourceforge.net/sourceforge/mingw/MSYS-1.0.10.exe ]
+   c) libiconv
+     [ http://osdn.dl.sourceforge.net/sourceforge/mingw/libiconv-1.8.0-2003.02.01-1.exe ]
+   d) gettext
+     [ http://osdn.dl.sourceforge.net/sourceforge/mingw/gettext-0.11.5-2003.02.01-1.exe ]
+2. Install Python >= 2.3 from http://www.python.org/
+   [ http://www.python.org/ftp/python/2.3.3/Python-2.3.3.exe ]
+
+
+Setup for Unix/Linux or Mac OS X
+-------------------------------
+0. Help note
+   Run "python setup.py --help" for help about command options
+1. Compile Python modules
+   Run "python setup.py build" to compile the Python files.
+   The CC environment variable is checked before compilation, so you can
+   change the default C compiler with "export CC=myccompiler".
+2. Install Python modules
+   Run "python setup.py install" to install everything.
+
+
+Setup for Windows
+-----------------
+0. Install check
+   Be sure to have installed all required software listed above.
+1. Preparing Python for the MinGW compiler
+   Search the file python23.dll in your windows folder.
+   After you found it, launch MSYS. Then change into your windows folder,
+   for example with:
+   % cd c:\winnt\system32
+   After that execute:
+   % pexports python23.dll > python23.def
+   Then use the dlltool
+   % dlltool --dllname python23.dll --def python23.def --output-lib libpython23.a
+   The resulting library has to be placed in the same directory as
+   python23.lib. (Should be the libs directory under your Python installation
+   directory.)
+2. Compile gettext translations
+   Change to the linkchecker-X.X.X/po directory:
+   % make win
+3. Compile and install the Python modules
+   Start a DOS command shell and change to the linkchecker-X.X.X directory:
+   c:> python setup.py build -c mingw32 install
+
+
+Installation for other platforms
+--------------------------------
+If you happen to install LinkChecker on other platforms (for example
+Mac OS 9.x) then drop me a note.
+
+
+(Fast)CGI web interface
+-----------------------
+The *cgi files are three CGI scripts which you can use to run LinkChecker
+with a nice graphical web interface.
+You can use and adjust the example HTML files in the lconline directory
+to run the script.
+1. Choose a CGI script. The simplest is lc.cgi and you need a web server
+   with CGI support.
+   The scripts lc.fcgi (I tested this a while ago) and lc.sz_fcgi
+   (untested) need a web server with FastCGI support.
+2. Copy the script of your choice in the CGI directory.
+3. Adjust the "action=..." parameter in lconline/lc_cgi.html
+   to point to your CGI script.
+4. load the lconline/index.html file, enter an URL and klick on the
+   check button
+5. If something goes wrong, check the following:
+   a) look in the error log of your web server
+   b) be sure that you have enabled CGI support in your web server
+      do this by running other CGI scripts from which you know that
+      they are working
+   c) try to run the lc.cgi script by hand
+   d) try the testit() function in the lc.cgi script