diff --git a/doc/en/documentation.html b/doc/en/documentation.html new file mode 100644 index 00000000..3c10f0e3 --- /dev/null +++ b/doc/en/documentation.html @@ -0,0 +1,303 @@ + + + +
+ + +To check an URL like http://www.myhomepage.org/ it is enough to +execute linkchecker http://www.myhomepage.org/. This will check the +complete domain of www.myhomepage.org recursively. All links pointing +outside of the domain are also checked for validity.
+For more options, read the man page linkchecker(1) or execute +linkchecker -h.
+All URLs have to pass a preliminary syntax test. Minor quoting +mistakes will issue a warning, all other invalid syntax issues +are errors. +After the syntax check passes, the URL is queued for connection +checking. All connection check types are described below.
+HTTP links (http:, https:)
+After connecting to the given HTTP server the given path +or query is requested. All redirections are followed, and +if user/password is given it will be used as authorization +when necessary. +Permanently moved pages issue a warning. +All final HTTP status codes other than 2xx are errors.
+Local files (file:)
+A regular, readable file that can be opened is valid. A readable +directory is also valid. All other files, for example device files, +unreadable or non-existing files are errors.
+File contents are checked for recursion.
+Mail links (mailto:)
+A mailto: link eventually resolves to a list of email addresses. +If one address fails, the whole list will fail. +For each mail address we check the following things:
+FTP links (ftp:)
+For FTP links we do:
+Gopher links (gopher:)
+We try to send the given selector (or query) to the gopher server.
+Telnet links (telnet:)
+We try to connect and if user/password are given, login to the +given telnet server.
+NNTP links (news:, snews:, nntp)
+We try to connect to the given NNTP server. If a news group or +article is specified, try to request it from the server.
+Ignored links (javascript:, etc.)
+An ignored link will only print a warning. No further checking +will be made.
+Here is a complete list of recognized, but ignored links. The most +prominent of them should be JavaScript links.
+Recursion occurs on HTML files, Opera bookmark files and directories. +Note that the directory recursion reads all files in that +directory, not just a subset like index.htm*.
+Q: LinkChecker produced an error, but my web page is ok with +Netscape/IE/Opera/... +Is this a bug in LinkChecker?
+A: Please check your web pages first. Are they really ok? Use +a syntax highlighting editor. Use HTML Tidy. +Check if you are using a proxy which produces the error.
+Q: I still get an error, but the page is definitely ok.
+A: Some servers deny access of automated tools (also called robots) +like LinkChecker. This is not a bug in LinkChecker but rather a +policy by the webmaster running the website you are checking. +It might even be possible for a website to send robots different +web pages than normal browsers.
+Q: How can I tell LinkChecker which proxy to use?
+A: LinkChecker works transparently with proxies. In a Unix or Windows +environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy +environment variables to a URL that identifies the proxy server before +starting LinkChecker. For example
++$ http_proxy="http://www.someproxy.com:3128" +$ export http_proxy ++
In a Macintosh environment, LinkChecker will retrieve proxy information +from Internet Config.
+Q: The link "mailto:john@company.com?subject=Hello John" is reported +as an error.
+A: You have to quote special characters (e.g. spaces) in the subject field. +The correct link should be "mailto:...?subject=Hello%20John" +Unfortunately browsers like IE and Netscape do not enforce this.
+Q: Has LinkChecker JavaScript support?
+A: No, it never will. If your page is not working without JS then your +web design is broken. +Use PHP or Zope or ASP for dynamic content, and use JavaScript just as +an addon for your web pages.
+Q: I don't get this --extern/--intern stuff.
+A: When it comes to checking there are three types of URLs. Note +that local files are also represented als URLs (ie file://). So +local files can be external URLs.
+LinkChecker provides four options which affect URLs to fall in one +of those three categories: --intern, --extern, --extern-strict-all and +--denyallow. +By default all URLs are internal. With --extern you specify what URLs +are external. With --intern you specify what URLs are internal. +Now imagine you have both --extern and --intern. What happens +when an URL matches both patterns? Or when it matches none? In this +situation the --denyallow option specifies the order in which we match +the URL. By default it is internal/external, with --denyallow the order is +external/internal. Either way, the first match counts, and if none matches, +the last checked category is the category for the URL. +Finally, with --extern-strict-all all external URLs are strict.
+Oh, and just to boggle your mind: you can have more than one external +regular expression in a config file and for each of those expressions +you can specify if those matched external URLs should be strict or not.
+An example. We don't want to check mailto urls. Then its +-i'!^mailto:'. The '!' negates an expression. With --extern-strictall, +we don't even connect to any mail hosts.
+Another example. We check our site www.mycompany.com, don't recurse +into external links point outside from our site and want to ignore links +to hollowood.com and hullabulla.com completely. +This can only be done with a configuration entry like
++[filtering] +extern1=hollowood.com 1 +extern2=hullabulla.com 1 +# the 1 means strict external ie don't even connect ++
and the command +linkchecker --intern=www.mycompany.com www.mycompany.com
+Q: Is LinkCheckers cookie feature insecure?
+A: Cookies can not store more information as is in the HTTP request itself, +so you are not giving away any more system information. +After storing however, the cookies are sent out to the server on request. +Not to every server, but only to the one who the cookie originated from! +This could be used to "track" subsequent requests to this server, +and this is what some people annoys (including me). +Cookies are only stored in memory. After LinkChecker finishes, they +are lost. So the tracking is restricted to the checking time. +The cookie feature is disabled as default.
+Q: I want to have my own logging class. How can I use it in LinkChecker?
+A: Currently, only a Python API lets you define new logging classes. +Define your own logging class as a subclass of StandardLogger or any other +logging class in the log module. +Then call the addLogger function in Config.Configuration to register +your new Logger. +After this append a new Logging instance to the fileoutput.
+
+import linkcheck, MyLogger
+log_format = 'mylog'
+log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
+cfg = linkcheck.configuration.Configuration()
+cfg.logger_add(log_format, MyLogger.MyLogger)
+cfg['fileoutput'].append(cfg.logger_new(log_format, log_args))
+
+Q: LinkChecker does not ignore anchor references on caching.
+Q: Some links with anchors are getting checked twice.
+A: This is not a bug. +It is common practice to believe that if an URL ABC#anchor1 works then +ABC#anchor2 works too. That is not specified anywhere and I have seen +server-side scripts that fail on some anchors and not on others. +This is the reason for always checking URLs with different anchors. +If you really want to disable this, use the --no-anchor-caching +option.
+Q: I see LinkChecker gets a /robots.txt file for every site it +checks. What is that about?
+A: LinkChecker follows the robots.txt exclusion standard. To avoid +misuse of LinkChecker, you cannot turn this feature off. +See the Web Robot pages and the Spidering report for more info.
+Q: Ctrl-C does not stop LinkChecker immediately. Why is that so?
+A: The Python interpreter has to wait for all threads to finish, and +this means waiting for all open sockets to close. The default timeout +for sockets is 30 seconds, hence the delay. +You can change the default socket timeout with the --timeout option.
+Q: How do I print unreachable/dead documents of my website with +LinkChecker?
+A: No can do. This would require file system access to your web +repository and access to your web server configuration.
+You can instead store the linkchecker results in a database +and look for missing files.
+Q: How do I check HTML/XML syntax with LinkChecker?
+A: No can do. Use the HTML Tidy program.
+Download the latest packages from LinkChecker download section. +There are also Md5sum checksums from above files.
+Requirements and installation instructions are located at the +install documentation. To see what has changed between releases +look at the ChangeLog.
++++
++ + ++ + + + ++ ++ + Commandline interface +Web interface +
The local configuration file is $HOME/.linkcheckerrc +Type "linkchecker" followed by your URLs you want to check. +Type "linkchecker -h" for help.
+Start "Check URL" in your LinkChecker program group. +URL input is interactive. +Another way is executing "python.exe linkchecker" in the Python +Scripts directory.
+Read the MacOS Python documentation to find out about passing +commandline options to Python scripts.
+For german output execute "export LC_MESSAGES=de" in bash or +"setenv LC_MESSAGES de" in tcsh. +Under Windows, execute "set LC_MESSAGES=de". +Other supported languages are 'nl' (Nederlands) and 'fr' (français).
+You can help to translate LinkChecker by copying the included +linkchecker.pot file to language.po, translate it and +send it to me.
+The SourceForge Bug interface allows submitting of bugs, patches +and requests.
+The SourceForge CVS page has all the information on how to +obtain the development version of LinkChecker. Development of +LinkChecker requires some more software to be available, which +is documented on the installation page.
+ +If you are upgrading from older versions of LinkChecker you should +also read the upgrading documentation.
+You need a standard GNU development environment with
+C compiler (for example the GNU C Compiler gcc)
+Depending on your distribution, several development packages +might be needed to provide a fully functional C development +environment.
+Note for developers: if you want to regenerate the po/linkchecker.pot +template from the source files, you will need xgettext with Python +support. This is available in gettext >= 0.12.
+Python >= 2.4 from http://www.python.org/ with zlib support
+Be sure to also have installed the included distutils module. +On most distributions, the distutils module is included in +an extra "python-dev" package.
+Optional, for bash-completion: +optcomplete Python module from http://furius.ca/optcomplete/
+Optional (speedup for i386 compatible PCs) +Psyco from http://psyco.sourceforge.net/ +[http://osdn.dl.sourceforge.net/sourceforge/psyco/psyco-1.4-src.tar.gz]
+Install check
+Be sure to have installed all required Unix/Linux software listed above.
+Compile Python modules
+Run python setup.py build to compile the Python files. +For help about the setup.py script options, run +python setup.py --help. +The CC environment variable is checked before compilation, so you can +change the default C compiler with export CC=myccompiler.
++++
+- +
Installation as root
+Run su -c 'python setup.py install' to install LinkChecker.
+- +
Installation as a normal user
+Run python setup.py install --home $HOME. Note that you have +to adjust your PATH and PYTHONPATH environment variables, eg. by +adding the commands export PYTHONPATH=$HOME/lib/python and +export PATH=$PATH:$HOME/bin to your shell configuration +file.
+For more information look at the Modifying Python's search path +documentation.
+If you downloaded Psyco please read the psyco installation docs.
+
Install check
+Be sure to have installed all required windows software listed above.
+Execute the linkchecker-x.xx.win32-py2.4.exe file and follow +the instructions.
+Install check
+Be sure to have installed all required Unix/Linux software listed above.
+Preparing Python for the MinGW compiler
+Search the file python24.dll in your windows folder. +After you found it, launch MSYS. Change into the windows folder, +for example cd c:\winnt\system32. Then execute +pexports python24.dll > python24.def. +Then use the dlltool with +dlltool --dllname python24.dll --def python24.def --output-lib +libpython24.a. +The resulting library has to be placed in the same directory as +python24.lib. (Should be the libs directory under your Python installation +directory, for example c:\Python24\Libs\.)
+Generate and execute the LinkChecker installer
+Close the MSYS application (by typing exit) and open a DOS command +prompt. +Change to the linkchecker-X.X.X directory and run +python setup.py build -c mingw32 bdist_wininst.
+This generates a binary installer +dist\linkchecker-X.X.X.win32-py2.4.exe which you just have to +execute.
+If you downloaded Psyco please read the psyco installation docs.
+LinkChecker is now installed. Have fun! +See the main page on how to configure and start LinkChecker.
+If you happen to install LinkChecker on other platforms (for example +Mac OS 9.x) then drop me a note.
+The included CGI scripts can run LinkChecker with a nice graphical web +interface. +You can use and adjust the example HTML files in the lconline directory +to run the script.
+If LinkChecker does not fit your requirements, you can check out the +competition. All of these programs have also an Open Source license +like LinkChecker.
+The per-user config file is now ~/.linkchecker/linkcheckerrc +(previous location was ~/.linkcheckerrc ).
+The default blacklist output file is now ~/.linkchecker/blacklist +(previous location was ~/.blacklist).
+Python >= 2.4 is now required.
+The --output and --file-output parameters can specify the encoding +now. You should check your scripts if they support the new option +syntax.
+Some added checks might trigger new warnings, so automated scripts +or alarms can have more output than with 1.x releases.
+All output (file and console) is now encoded according to a given +character set encoding which defaults to ISO-8859-15. If you +relied that output was in a specific encoding, you might want to +use the output encoding option.
+Since lots of filenames have changed you should check that any +manually installed versions prior to 1.13.0 are removed. Otherwise +you will have startup problems.
+The default output logger text has now colored output if the +output terminal supports it. The old colored output logger has +been removed.
+The -F option no longer suppresses normal output. The old behaviour +can be restored by giving the option -onone.
+The --status option is now the default and has been deprecated. The +old behaviour can be restored by giving the option --no-status.
+The default recursion depth is now infinite. The old behaviour +can be restored by giving the option --recursion-level=1.
+The option --strict has been renamed to --extern-strict-all.
+The commandline program linkchecker returns now non-zero exit value +when errors were encountered. Previous versions always return a zero +exit value. +For scripts to ignore exit values and therefore restore the old behaviour +you can append a || true at the end of the command.
+