Use direct HTML documentation for the GUI client; moved the homepage content to a separate package.

This commit is contained in:
Bastian Kleineidam 2009-07-20 18:32:54 +02:00
parent 9faa7d33d2
commit fd29a15af7
19 changed files with 227 additions and 923 deletions

View file

Before

Width:  |  Height:  |  Size: 3.6 KiB

After

Width:  |  Height:  |  Size: 3.6 KiB

View file

Before

Width:  |  Height:  |  Size: 907 B

After

Width:  |  Height:  |  Size: 907 B

View file

Before

Width:  |  Height:  |  Size: 2.6 KiB

After

Width:  |  Height:  |  Size: 2.6 KiB

227
doc/index.html Normal file
View file

@ -0,0 +1,227 @@
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Check websites for broken links</title>
<link rel="stylesheet" href="_static/sphinxdoc.css" type="text/css" />
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="top" title="LinkChecker" href="" />
<style type="text/css">
img { border: 0; }
</style>
</head>
<body>
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
<table border="0"><tr>
<td><a href=""><img
src="_static/logo64x64.png" border="0" alt="LinkChecker"/></a></td>
<td><h1>LinkChecker</h1></td>
</tr></table>
</div>
<div class="document">
<div class="documentwrapper">
<div class="body">
<div class="section" id="check-websites-for-broken-links">
<h1>Check websites for broken links</h1>
<p>LinkChecker is a free, <a class="reference external" href="http://www.gnu.org/licenses/gpl-2.0.html">GPL</a> licensed URL validator.</p>
<div class="section" id="basic-usage">
<h2>Basic usage</h2>
<p>To check a URL like <tt class="docutils literal"><span class="pre">http://www.myhomepage.org/</span></tt> it is enough to
execute <tt class="docutils literal"><span class="pre">linkchecker</span> <span class="pre">http://www.myhomepage.org/</span></tt>. This will check the
complete domain of www.myhomepage.org recursively. All links pointing
outside of the domain are also checked for validity.</p>
</div>
<div class="section" id="performed-checks">
<h2>Performed checks</h2>
<p>All URLs have to pass a preliminary syntax test. Minor quoting
mistakes will issue a warning, all other invalid syntax issues
are errors.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.</p>
<ul>
<li><p class="first">HTTP links (<tt class="docutils literal"><span class="pre">http:</span></tt>, <tt class="docutils literal"><span class="pre">https:</span></tt>)</p>
<p>After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages issue a warning.
All final HTTP status codes other than 2xx are errors.</p>
</li>
<li><p class="first">Local files (<tt class="docutils literal"><span class="pre">file:</span></tt>)</p>
<p>A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example device files,
unreadable or non-existing files are errors.</p>
<p>File contents are checked for recursion.</p>
</li>
<li><p class="first">Mail links (<tt class="docutils literal"><span class="pre">mailto:</span></tt>)</p>
<p>A mailto: link eventually resolves to a list of email addresses.
If one address fails, the whole list will fail.
For each mail address we check the following things:</p>
<ol class="arabic simple">
<li>Check the adress syntax, both of the part before and after
the &#64; sign.</li>
<li>Look up the MX DNS records. If we found no MX record,
print an error.</li>
<li>Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.</li>
<li>Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.</li>
</ol>
</li>
<li><p class="first">FTP links (<tt class="docutils literal"><span class="pre">ftp:</span></tt>)</p>
<p>For FTP links we do:</p>
<ol class="arabic simple">
<li>connect to the specified host</li>
<li>try to login with the given user and password. The default
user is <tt class="docutils literal"><span class="pre">anonymous</span></tt>, the default password is <tt class="docutils literal"><span class="pre">anonymous&#64;</span></tt>.</li>
<li>try to change to the given directory</li>
<li>list the file with the NLST command</li>
</ol>
</li>
<li><p class="first">Telnet links (<tt class="docutils literal"><span class="pre">telnet:</span></tt>)</p>
<p>We try to connect and if user/password are given, login to the
given telnet server.</p>
</li>
<li><p class="first">NNTP links (<tt class="docutils literal"><span class="pre">news:</span></tt>, <tt class="docutils literal"><span class="pre">snews:</span></tt>, <tt class="docutils literal"><span class="pre">nntp</span></tt>)</p>
<p>We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.</p>
</li>
<li><p class="first">Ignored links (<tt class="docutils literal"><span class="pre">javascript:</span></tt>, etc.)</p>
<p>An ignored link will only print a warning. No further checking
will be made.</p>
<p>Here is a complete list of recognized, but ignored links. The most
prominent of them should be JavaScript links.</p>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">acap:</span></tt> (application configuration access protocol)</li>
<li><tt class="docutils literal"><span class="pre">afs:</span></tt> (Andrew File System global file names)</li>
<li><tt class="docutils literal"><span class="pre">chrome:</span></tt> (Mozilla specific)</li>
<li><tt class="docutils literal"><span class="pre">cid:</span></tt> (content identifier)</li>
<li><tt class="docutils literal"><span class="pre">clsid:</span></tt> (Microsoft specific)</li>
<li><tt class="docutils literal"><span class="pre">data:</span></tt> (data)</li>
<li><tt class="docutils literal"><span class="pre">dav:</span></tt> (dav)</li>
<li><tt class="docutils literal"><span class="pre">fax:</span></tt> (fax)</li>
<li><tt class="docutils literal"><span class="pre">find:</span></tt> (Mozilla specific)</li>
<li><tt class="docutils literal"><span class="pre">gopher:</span></tt> (Gopher)</li>
<li><tt class="docutils literal"><span class="pre">imap:</span></tt> (internet message access protocol)</li>
<li><tt class="docutils literal"><span class="pre">isbn:</span></tt> (ISBN (int. book numbers))</li>
<li><tt class="docutils literal"><span class="pre">javascript:</span></tt> (JavaScript)</li>
<li><tt class="docutils literal"><span class="pre">ldap:</span></tt> (Lightweight Directory Access Protocol)</li>
<li><tt class="docutils literal"><span class="pre">mailserver:</span></tt> (Access to data available from mail servers)</li>
<li><tt class="docutils literal"><span class="pre">mid:</span></tt> (message identifier)</li>
<li><tt class="docutils literal"><span class="pre">mms:</span></tt> (multimedia stream)</li>
<li><tt class="docutils literal"><span class="pre">modem:</span></tt> (modem)</li>
<li><tt class="docutils literal"><span class="pre">nfs:</span></tt> (network file system protocol)</li>
<li><tt class="docutils literal"><span class="pre">opaquelocktoken:</span></tt> (opaquelocktoken)</li>
<li><tt class="docutils literal"><span class="pre">pop:</span></tt> (Post Office Protocol v3)</li>
<li><tt class="docutils literal"><span class="pre">prospero:</span></tt> (Prospero Directory Service)</li>
<li><tt class="docutils literal"><span class="pre">rsync:</span></tt> (rsync protocol)</li>
<li><tt class="docutils literal"><span class="pre">rtsp:</span></tt> (real time streaming protocol)</li>
<li><tt class="docutils literal"><span class="pre">service:</span></tt> (service location)</li>
<li><tt class="docutils literal"><span class="pre">shttp:</span></tt> (secure HTTP)</li>
<li><tt class="docutils literal"><span class="pre">sip:</span></tt> (session initiation protocol)</li>
<li><tt class="docutils literal"><span class="pre">tel:</span></tt> (telephone)</li>
<li><tt class="docutils literal"><span class="pre">tip:</span></tt> (Transaction Internet Protocol)</li>
<li><tt class="docutils literal"><span class="pre">tn3270:</span></tt> (Interactive 3270 emulation sessions)</li>
<li><tt class="docutils literal"><span class="pre">vemmi:</span></tt> (versatile multimedia interface)</li>
<li><tt class="docutils literal"><span class="pre">wais:</span></tt> (Wide Area Information Servers)</li>
<li><tt class="docutils literal"><span class="pre">z39.50r:</span></tt> (Z39.50 Retrieval)</li>
<li><tt class="docutils literal"><span class="pre">z39.50s:</span></tt> (Z39.50 Session)</li>
</ul>
</li>
</ul>
</div>
<div class="section" id="recursion">
<h2>Recursion</h2>
<p>Before descending recursively into a URL, it has to fulfill several
conditions. They are checked in this order:</p>
<ol class="arabic simple">
<li>A URL must be valid.</li>
<li>A URL must be parseable. This currently includes HTML files,
Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.</li>
<li>The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.</li>
<li>The maximum recursion level must not be exceeded. It is configured
with the <tt class="docutils literal"><span class="pre">--recursion-level</span></tt> option and is unlimited per default.</li>
<li>It must not match the ignored URL list. This is controlled with
the <tt class="docutils literal"><span class="pre">--ignore-url</span></tt> option.</li>
<li>The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by searching for a
&#8220;nofollow&#8221; directive in the HTML header data.</li>
</ol>
<p>Note that the directory recursion reads all files in that
directory, not just a subset like <tt class="docutils literal"><span class="pre">index.htm*</span></tt>.</p>
</div>
<div class="section" id="frequently-asked-questions">
<h2>Frequently asked questions</h2>
<p><strong>Q: LinkChecker produced an error, but my web page is ok with
Mozilla/IE/Opera/...
Is this a bug in LinkChecker?</strong></p>
<p>A: Please check your web pages first. Are they really ok?
Use the <tt class="docutils literal"><span class="pre">--check-html</span></tt> option, or check if you are using a proxy
which produces the error.</p>
<p><strong>Q: I still get an error, but the page is definitely ok.</strong></p>
<p>A: Some servers deny access of automated tools (also called robots)
like LinkChecker. This is not a bug in LinkChecker but rather a
policy by the webmaster running the website you are checking. Look
the <tt class="docutils literal"><span class="pre">/robots.txt</span></tt> file which follows the <a class="reference external" href="http://www.robotstxt.org/wc/norobots-rfc.html">robots.txt exclusion standard</a>.</p>
<p><strong>Q: How can I tell LinkChecker which proxy to use?</strong></p>
<p>A: LinkChecker works transparently with proxies. In a Unix or Windows
environment, set the http_proxy, https_proxy, ftp_proxy environment
variables to a URL that identifies the proxy server before starting
LinkChecker. For example</p>
<div class="highlight-python"><pre>$ http_proxy="http://www.someproxy.com:3128"
$ export http_proxy</pre>
</div>
<p><strong>Q: The link &#8220;mailto:john&#64;company.com?subject=Hello John&#8221; is reported
as an error.</strong></p>
<p>A: You have to quote special characters (e.g. spaces) in the subject field.
The correct link should be &#8220;mailto:...?subject=Hello%20John&#8221;
Unfortunately browsers like IE and Netscape do not enforce this.</p>
<p><strong>Q: Has LinkChecker JavaScript support?</strong></p>
<p>A: No, it never will. If your page is not working without JS, it is
better checked with a browser testing tool like <a class="reference external" href="http://seleniumhq.org/">Selenium</a>.</p>
<p><strong>Q: Is LinkCheckers cookie feature insecure?</strong></p>
<p>A: If a cookie file is specified, the information will be sent
to the specified hosts.
The following restrictions apply for LinkChecker cookies:</p>
<ul class="simple">
<li>Cookies will only be sent to the originating server.</li>
<li>Cookies are only stored in memory. After LinkChecker finishes, they
are lost.</li>
<li>The cookie feature is disabled as default.</li>
</ul>
<p><strong>Q: I see LinkChecker gets a /robots.txt file for every site it
checks. What is that about?</strong></p>
<p>A: LinkChecker follows the <a class="reference external" href="http://www.robotstxt.org/wc/norobots-rfc.html">robots.txt exclusion standard</a>. To avoid
misuse of LinkChecker, you cannot turn this feature off.
See the <a class="reference external" href="http://www.robotstxt.org/wc/robots.html">Web Robot pages</a> and the <a class="reference external" href="http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt">Spidering report</a> for more info.</p>
<p><strong>Q: How do I print unreachable/dead documents of my website with
LinkChecker?</strong></p>
<p>A: No can do. This would require file system access to your web
repository and access to your web server configuration.</p>
<p><strong>Q: How do I check HTML/XML/CSS syntax with LinkChecker?</strong></p>
<p>A: Use the <tt class="docutils literal"><span class="pre">--check-html</span></tt> and <tt class="docutils literal"><span class="pre">--check-css</span></tt> options.</p>
</div>
</div>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="footer">
&copy; Copyright 2009, Bastian Kleineidam.
</div>
</body>
</html>

View file

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 25 KiB

View file

Before

Width:  |  Height:  |  Size: 8 KiB

After

Width:  |  Height:  |  Size: 8 KiB

View file

@ -1,291 +0,0 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Copyright (C) 2007-2009 Bastian Kleineidam
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
"""
A script to lossless compress media files to be used in production
deployments of web software. Used together with HTML compression
it decreases the size of transmitted data considerably.
Currently supported media files:
Type Extension Compressor(s)
========================================================
JavaScript .js YUI compressor (a Java program)
CSS .css YUI compressor (a Java program)
PNG .png pngcrush (a C program)
JPEG .jpg jpegtran (a C program)
GIF .gif giftrans (a C program)
It compresses all supported media files to new files. The original
files will not be changed if not explicitely requested.
Compressed files will be named <filebase>-min.<ext> where <filebase> is
everything up to the last dot and <ext> is everything after the last dot.
If requested, the original file will be overwritten with the compressed one.
A directory will be recursively searched and all media files within will
be compressed.
Files will only be compressed when the compressed file is missing or the
original file is newer than the compressed file.
"""
import sys
import os
import getopt
import stat
import shutil
from distutils.spawn import spawn, find_executable
from distutils.errors import DistutilsExecError
import distutils.log
distutils.log.set_verbosity(1)
# list of extensions for compressable files
COMPRESS_EXTENSIONS = (".js", ".css", ".png", ".jpg", ".gif")
def log (*args):
for arg in args:
print >> sys.stderr, arg,
print >> sys.stderr
def usage (msg=None):
"""
Print usage information to sys.stderr and call sys.exit().
The exit code is zero if msg is None, else one.
"""
if msg is None:
err = 0
else:
print >> sys.stderr, msg
err = 1
thisfile = os.path.basename(__file__)
log("Usage:", thisfile, "[options]", "<file-or-directory>...")
log("Options:")
log(" --js-compressor - Specify the JavaScript compressor " \
"(default: yuicompressor.jar)")
log(" --exclude - Specify (part of) filenames to ignore")
log(" --overwrite - Comma-separated list of file extensions to overwrite")
log(" --help - Display help")
sys.exit(err)
class DirectoryWalker:
def __init__(self, directory):
self.stack = [directory]
self.files = []
self.index = 0
def __getitem__(self, index):
while 1:
try:
file = self.files[self.index]
self.index = self.index + 1
except IndexError:
# pop next directory from stack
self.directory = self.stack.pop()
self.files = os.listdir(self.directory)
self.index = 0
else:
# got a filename
fullname = os.path.join(self.directory, file)
if os.path.isdir(fullname) and not os.path.islink(fullname):
self.stack.append(fullname)
return fullname
def is_compressable (settings, filename):
"Check if given filename is compressable."
# is it excluded?
if [x for x in settings["exclude"] if x in filename]:
return False
# is it compressable?
return os.path.splitext(filename)[1] in COMPRESS_EXTENSIONS
def get_files (settings, args):
"""
Given a list of files and/or directories return all compressable
files as an iterator.
"""
for arg in args:
if os.path.isdir(arg):
for file in DirectoryWalker(arg):
if is_compressable(settings, file):
yield file
elif os.path.isfile(arg):
if is_compressable(settings, arg):
yield arg
else:
log("Warning: not a file or directory", repr(arg))
settings = {
# default compress executables
"compressor": {
".js": "yuicompressor.jar", # Note: automatically suffixes with "java"
".css": "yuicompressor.jar",
".png": "pngcrush",
".jpg": "jpegtran",
".gif": "giftrans",
},
# list of filenames (or a part of them) to exclude
"exclude": set(),
# list of file extensions to overwrite
"overwrite": set(),
}
def parse_options (args):
"""
Parse command line arguments.
@return: (settings, args)
@rtype: tuple (dict, list)
"""
long_opts = ["help", "js-compressor=", "exclude=", "overwrite="]
try:
opts, args = getopt.getopt(args, "", long_opts)
except getopt.error:
usage(msg=sys.exc_info()[1])
for opt, arg in opts:
if opt == "--help":
usage()
elif opt == "--js-compressor":
for ext in (".js", ".css"):
settings["compressor"][ext] = arg
elif opt == "--exclude":
settings["exclude"].add(arg)
elif opt == "--overwrite":
exts = [x.strip().lower() for x in arg.split(",") if x]
settings["overwrite"].update(exts)
else:
usage(msg="Unbekannte Option %r" % opt)
return settings, args
def get_mtime (filename):
"Return modification time of file."
return os.stat(filename)[stat.ST_MTIME]
def get_fsize (filename):
"Return file size in bytes."
return os.stat(filename)[stat.ST_SIZE]
def needs_compression (infile, outfile):
"Check if infile needs to be compressed to given outfile."
if not os.path.exists(outfile):
return True
return get_mtime(infile) > get_mtime(outfile)
def compress_file (infile):
"Compress given file if needed."
base, ext = os.path.splitext(infile)
if base.endswith("-min"):
#log("Ignoring", repr(infile))
return
outfile = "%s-min%s" % (base, ext)
if needs_compression(infile, outfile):
cmd = compress_cmd(ext, infile, outfile)
if not cmd:
log("Skipping", repr(infile), "no compressor available")
return
try:
log("Compressing", repr(infile), "...")
run_cmd(cmd)
except DistutilsExecError, msg:
log("Error running %s: %s" % (cmd, msg))
else:
insize = get_fsize(infile)
outsize = get_fsize(outfile)
if outsize > insize:
log("Warning: compressed file is bigger than original "
"(%dB > %dB); copying instead." % (insize, outsize))
shutil.copyfile(infile, outfile)
else:
percentage = float(outsize * 100) / insize
log(".. compressed to %.2f%% (%dB -> %dB)" % \
(percentage, insize, outsize))
if ext[1:].lower() in settings["overwrite"]:
shutil.move(outfile, infile)
else:
log("Skipping", repr(infile))
def compress_cmd (ext, infile, outfile):
"Get list of commands args for compression."
cmd = []
compressor = settings["compressor"][ext]
if compressor.endswith(".jar"):
if not find_executable("java"):
return None
cmd.insert(0, "java")
cmd.insert(1, "-jar")
elif not find_executable(compressor):
return None
cmd.append(compressor)
cmd.extend(compressor_args(compressor, infile, outfile))
return cmd
def compressor_args (compressor, infile, outfile):
"""
Return list of commandline arguments that compress infile to outfile
with given compressor.
"""
basename = os.path.basename(compressor).lower()
if basename.startswith("yuicompressor"):
args = compressor_args_yui(infile, outfile)
elif basename.startswith("pngcrush"):
args = compressor_args_pngcursh(infile, outfile)
elif basename.startswith("jpegtran"):
args = compressor_args_jpegtran(infile, outfile)
elif basename.startswith("giftrans"):
args = compressor_args_giftrans(infile, outfile)
else:
raise getopt.error("Unknown compressor %r" % compressor)
return args
def compressor_args_yui (infile, outfile):
return ["--charset", "utf8", "-o", outfile, infile]
def compressor_args_pngcursh (infile, outfile):
return [infile, outfile]
def compressor_args_jpegtran (infile, outfile):
return ["-optimize", "-perfect", "-copy", "none",
"-outfile", outfile, infile]
def compressor_args_giftrans (infile, outfile):
return ["-C", "-o", outfile, infile]
def run_cmd (cmd):
"Execute given command."
return spawn(cmd)
def main (args):
settings, args = parse_options(args)
for file in get_files(settings, args):
compress_file(file)
if __name__ == '__main__':
main(sys.argv[1:])

View file

Before

Width:  |  Height:  |  Size: 12 KiB

After

Width:  |  Height:  |  Size: 12 KiB

View file

Before

Width:  |  Height:  |  Size: 3.8 KiB

After

Width:  |  Height:  |  Size: 3.8 KiB

View file

Before

Width:  |  Height:  |  Size: 56 KiB

After

Width:  |  Height:  |  Size: 56 KiB

View file

Before

Width:  |  Height:  |  Size: 5.7 KiB

After

Width:  |  Height:  |  Size: 5.7 KiB

View file

Before

Width:  |  Height:  |  Size: 41 KiB

After

Width:  |  Height:  |  Size: 41 KiB

View file

Before

Width:  |  Height:  |  Size: 3.9 KiB

After

Width:  |  Height:  |  Size: 3.9 KiB

View file

@ -1,193 +0,0 @@
# -*- coding: utf-8 -*-
#
# LinkChecker documentation build configuration file, created by
# sphinx-quickstart on Tue Jan 20 23:59:41 2009.
#
# This file is execfile()d with the current directory set to its containing dir.
#
# The contents of this file are pickled, so don't put values in the namespace
# that aren't pickleable (module imports are okay, they're removed automatically).
#
# Note that not all possible configuration values are present in this
# autogenerated file.
#
# All configuration values have a default; values that are commented out
# serve to show the default.
#import sys, os
# If your extensions are in another directory, add it here. If the directory
# is relative to the documentation root, use os.path.abspath to make it
# absolute, like shown here.
#sys.path.append(os.path.abspath('.'))
# General configuration
# ---------------------
# Add any Sphinx extension module names here, as strings. They can be extensions
# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
extensions = []
# Add any paths that contain templates here, relative to this directory.
templates_path = ['templates']
# The suffix of source filenames.
source_suffix = '.txt'
# The encoding of source files.
#source_encoding = 'utf-8'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = u'LinkChecker'
copyright = u'2009, Bastian Kleineidam'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '5.0.2'
# The full version, including alpha/beta/rc tags.
release = version
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
#today = ''
# Else, today_fmt is used as the format for a strftime call.
#today_fmt = '%B %d, %Y'
# List of documents that shouldn't be included in the build.
#unused_docs = None
# List of directories, relative to source directory, that shouldn't be searched
# for source files.
exclude_trees = []
# The reST default role (used for this markup: `text`) to use for all documents.
#default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
#add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
add_module_names = False
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
#show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'friendly'
# Options for HTML output
# -----------------------
# The style sheet to use for HTML and HTML Help pages. A file of that name
# must exist either in Sphinx' static/ path, or in one of the custom paths
# given in html_static_path.
html_style = 'sphinxdoc.css'
# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> documentation".
html_title = project
# A shorter title for the navigation bar. Default is the same as html_title.
#html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
html_logo = None
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
html_favicon = "favicon.ico"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['static']
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
#html_last_updated_fmt = '%b %d, %Y'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
#html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
#html_additional_pages = {}
# If false, no module index is generated.
html_use_modindex = False
# If false, no index is generated.
html_use_index = False
# If true, the index is split into individual pages for each letter.
#html_split_index = False
# If true, the reST sources are included in the HTML build as _sources/<name>.
html_copy_source = False
# If true, an OpenSearch description file will be output, and all pages will
# contain a <link> tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
#html_use_opensearch = ''
# If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml").
#html_file_suffix = ''
# Output file base name for HTML help builder.
htmlhelp_basename = 'LinkCheckerdoc'
# Options for LaTeX output
# ------------------------
# The paper size ('letter' or 'a4').
latex_paper_size = 'a4'
# The font size ('10pt', '11pt' or '12pt').
#latex_font_size = '10pt'
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title, author, document class [howto/manual]).
latex_documents = [
('index', 'LinkChecker.tex', ur'LinkChecker Documentation',
ur'Bastian Kleineidam', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
#latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
#latex_use_parts = False
# Additional stuff for the LaTeX preamble.
#latex_preamble = ''
# Documents to append as an appendix to all manuals.
#latex_appendices = []
# If false, no module index is generated.
#latex_use_modindex = True
#def setup(app):
# app.add_config_value('foo', 'default', True)

View file

@ -1,270 +0,0 @@
Documentation
=============
Basic usage
-----------
To check a URL like ``http://www.myhomepage.org/`` it is enough to
execute ``linkchecker http://www.myhomepage.org/``. This will check the
complete domain of www.myhomepage.org recursively. All links pointing
outside of the domain are also checked for validity.
Performed checks
----------------
All URLs have to pass a preliminary syntax test. Minor quoting
mistakes will issue a warning, all other invalid syntax issues
are errors.
After the syntax check passes, the URL is queued for connection
checking. All connection check types are described below.
- HTTP links (``http:``, ``https:``)
After connecting to the given HTTP server the given path
or query is requested. All redirections are followed, and
if user/password is given it will be used as authorization
when necessary.
Permanently moved pages issue a warning.
All final HTTP status codes other than 2xx are errors.
- Local files (``file:``)
A regular, readable file that can be opened is valid. A readable
directory is also valid. All other files, for example device files,
unreadable or non-existing files are errors.
File contents are checked for recursion.
- Mail links (``mailto:``)
A mailto: link eventually resolves to a list of email addresses.
If one address fails, the whole list will fail.
For each mail address we check the following things:
1) Check the adress syntax, both of the part before and after
the @ sign.
2) Look up the MX DNS records. If we found no MX record,
print an error.
3) Check if one of the mail hosts accept an SMTP connection.
Check hosts with higher priority first.
If no host accepts SMTP, we print a warning.
4) Try to verify the address with the VRFY command. If we got
an answer, print the verified address as an info.
- FTP links (``ftp:``)
For FTP links we do:
1) connect to the specified host
2) try to login with the given user and password. The default
user is ``anonymous``, the default password is ``anonymous@``.
3) try to change to the given directory
4) list the file with the NLST command
- Telnet links (``telnet:``)
We try to connect and if user/password are given, login to the
given telnet server.
- NNTP links (``news:``, ``snews:``, ``nntp``)
We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.
- Ignored links (``javascript:``, etc.)
An ignored link will only print a warning. No further checking
will be made.
Here is a complete list of recognized, but ignored links. The most
prominent of them should be JavaScript links.
- ``acap:`` (application configuration access protocol)
- ``afs:`` (Andrew File System global file names)
- ``chrome:`` (Mozilla specific)
- ``cid:`` (content identifier)
- ``clsid:`` (Microsoft specific)
- ``data:`` (data)
- ``dav:`` (dav)
- ``fax:`` (fax)
- ``find:`` (Mozilla specific)
- ``gopher:`` (Gopher)
- ``imap:`` (internet message access protocol)
- ``isbn:`` (ISBN (int. book numbers))
- ``javascript:`` (JavaScript)
- ``ldap:`` (Lightweight Directory Access Protocol)
- ``mailserver:`` (Access to data available from mail servers)
- ``mid:`` (message identifier)
- ``mms:`` (multimedia stream)
- ``modem:`` (modem)
- ``nfs:`` (network file system protocol)
- ``opaquelocktoken:`` (opaquelocktoken)
- ``pop:`` (Post Office Protocol v3)
- ``prospero:`` (Prospero Directory Service)
- ``rsync:`` (rsync protocol)
- ``rtsp:`` (real time streaming protocol)
- ``service:`` (service location)
- ``shttp:`` (secure HTTP)
- ``sip:`` (session initiation protocol)
- ``tel:`` (telephone)
- ``tip:`` (Transaction Internet Protocol)
- ``tn3270:`` (Interactive 3270 emulation sessions)
- ``vemmi:`` (versatile multimedia interface)
- ``wais:`` (Wide Area Information Servers)
- ``z39.50r:`` (Z39.50 Retrieval)
- ``z39.50s:`` (Z39.50 Session)
Recursion
---------
Before descending recursively into a URL, it has to fulfill several
conditions. They are checked in this order:
1. A URL must be valid.
2. A URL must be parseable. This currently includes HTML files,
Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.
3. The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.
4. The maximum recursion level must not be exceeded. It is configured
with the ``--recursion-level`` option and is unlimited per default.
5. It must not match the ignored URL list. This is controlled with
the ``--ignore-url`` option.
6. The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by searching for a
"nofollow" directive in the HTML header data.
Note that the directory recursion reads all files in that
directory, not just a subset like ``index.htm*``.
Frequently asked questions
--------------------------
**Q: LinkChecker produced an error, but my web page is ok with
Mozilla/IE/Opera/...
Is this a bug in LinkChecker?**
A: Please check your web pages first. Are they really ok?
Use the ``--check-html`` option, or check if you are using a proxy
which produces the error.
**Q: I still get an error, but the page is definitely ok.**
A: Some servers deny access of automated tools (also called robots)
like LinkChecker. This is not a bug in LinkChecker but rather a
policy by the webmaster running the website you are checking. Look
the ``/robots.txt`` file which follows the `robots.txt exclusion standard`_.
.. _`robots.txt exclusion standard`:
http://www.robotstxt.org/wc/norobots-rfc.html
**Q: How can I tell LinkChecker which proxy to use?**
A: LinkChecker works transparently with proxies. In a Unix or Windows
environment, set the http_proxy, https_proxy, ftp_proxy environment
variables to a URL that identifies the proxy server before starting
LinkChecker. For example
::
$ http_proxy="http://www.someproxy.com:3128"
$ export http_proxy
**Q: The link "mailto:john@company.com?subject=Hello John" is reported
as an error.**
A: You have to quote special characters (e.g. spaces) in the subject field.
The correct link should be "mailto:...?subject=Hello%20John"
Unfortunately browsers like IE and Netscape do not enforce this.
**Q: Has LinkChecker JavaScript support?**
A: No, it never will. If your page is not working without JS, it is
better checked with a browser testing tool like Selenium_.
.. _Selenium:
http://seleniumhq.org/
**Q: Is LinkCheckers cookie feature insecure?**
A: Cookies can not store more information as is in the HTTP request itself,
so you are not giving away any more system information.
After storing however, the cookies are sent out to the server on request.
Not to every server, but only to the one who the cookie originated from!
This could be used to "track" subsequent requests to this server,
and this is what some people annoys (including me).
Cookies are only stored in memory. After LinkChecker finishes, they
are lost. So the tracking is restricted to the checking time.
The cookie feature is disabled as default.
**Q: I want to have my own logging class. How can I use it in LinkChecker?**
A: Currently, only a Python API lets you define new logging classes.
Define your own logging class as a subclass of StandardLogger or any other
logging class in the log module.
Then call the addLogger function in Config.Configuration to register
your new Logger.
After this append a new Logging instance to the fileoutput.
::
import linkcheck, MyLogger
log_format = 'mylog'
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
cfg = linkcheck.configuration.Configuration()
cfg.logger_add(log_format, MyLogger.MyLogger)
cfg['fileoutput'].append(cfg.logger_new(log_format, log_args))
**Q: LinkChecker does not ignore anchor references on caching.**
**Q: Some links with anchors are getting checked twice.**
A: This is not a bug.
It is not necessarily true that if a URL ``ABC#anchor1`` works then
``ABC#anchor2`` works too. That is not specified anywhere and there are
server-side scripts that fail on some anchors and not on others.
This is the reason for always checking URLs with different anchors.
If you really want to disable this, use the ``--no-anchor-caching``
option.
**Q: I see LinkChecker gets a /robots.txt file for every site it
checks. What is that about?**
A: LinkChecker follows the `robots.txt exclusion standard`_. To avoid
misuse of LinkChecker, you cannot turn this feature off.
See the `Web Robot pages`_ and the `Spidering report`_ for more info.
.. _`robots.txt exclusion standard`:
http://www.robotstxt.org/wc/norobots-rfc.html
.. _`Web Robot pages`:
http://www.robotstxt.org/wc/robots.html
.. _`Spidering report`:
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
**Q: How do I print unreachable/dead documents of my website with
LinkChecker?**
A: No can do. This would require file system access to your web
repository and access to your web server configuration.
**Q: How do I check HTML/XML/CSS syntax with LinkChecker?**
A: Use the ``--check-html`` and ``--check-css`` options.

View file

@ -1,49 +0,0 @@
.. meta::
:keywords: link, URL, validation, checking
===============================
Check websites for broken links
===============================
LinkChecker is a free, GPL_ licensed URL validator.
.. _GPL:
http://www.gnu.org/licenses/gpl-2.0.html
If you like LinkChecker, consider a donation_ to improve it even
more!
.. _donation:
http://sourceforge.net/project/project_donations.php?group_id=1913
Features
========
- recursive and multithreaded checking
- output in colored or normal text, HTML, SQL, CSV, XML or a sitemap
graph in different formats
- HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and local file
links support
- restriction of link checking with regular expression filters for URLs
- proxy support
- username/password authorization for HTTP and FTP and Telnet
- honors robots.txt exclusion protocol
- Cookie support
- HTML and CSS syntax check
- Antivirus check
- a command line interface
- a GUI client interface
- a (Fast)CGI web interface (requires HTTP server)
Screenshots
===========
+------------------------------------+------------------------------------+------------------------------------+
| .. image:: shot1_thumb.jpg | .. image:: shot2_thumb.jpg | .. image:: shot3_thumb.jpg |
| :align: center | :align: center | :align: center |
| :target: _static/shot1.png | :target: _static/shot2.png | :target: _static/shot3.png |
+------------------------------------+------------------------------------+------------------------------------+
| Commandline interface | GUI client | Web interface |
+------------------------------------+------------------------------------+------------------------------------+

View file

@ -1,54 +0,0 @@
Other link checkers
===================
If LinkChecker does not fit your requirements, you can check out the
competition. All of these programs have also an `Open Source license`_
like LinkChecker.
.. _`Open Source license`:
http://www.opensource.org/licenses/
- `Checklinks`_ written in Perl
.. _Checklinks:
http://www.jmarshall.com/tools/cl/
- `Dead link check`_ written in Perl
.. _Dead link check:
http://dlc.sourceforge.net/
- `gURLChecker`_ written in C
.. _gURLChecker:
http://labs.libre-entreprise.org/projects/gurlchecker/
- `KLinkStatus`_ written in C++
.. _KLinkStatus:
http://klinkstatus.kdewebdev.org/
- `link-checker`_ written in C
.. _link-checker:
http://ymettier.free.fr/link-checker/link-checker.html
- `linklint`_ written in Perl
.. _linklint:
http://www.linklint.org/
- `W3C Link Checker`_ HTML interface only
.. _W3C Link Checker:
http://validator.w3.org/checklink/
- `webcheck`_ written in Python
.. _webcheck:
http://ch.tudelft.nl/~arthur/webcheck/
- `webgrep`_ written in Perl
.. _webgrep:
http://cgi.linuxfocus.org/~guido/index.html#webgrep

View file

@ -1,3 +0,0 @@
favicon.ico: favicon32x32.png favicon16x16.png
png2ico favicon.ico favicon32x32.png favicon16x16.png

View file

@ -1,63 +0,0 @@
{% extends "!layout.html" %}
{% block extrahead %}
<style type="text/css">
img { border: 0; }
</style>
{% endblock %}
{% block rootrellink %}
<li><a href="{{ pathto('index') }}">Home </a> |&nbsp;</li>
<li><a href="{{ pathto('documentation') }}">Documentation </a>|&nbsp;</li>
<li><a href="{{ pathto('other') }}">Other link checkers </a> </li>
{% endblock %}
{% block relbar1 %}
<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
{% if builder == 'html' %}
<div style="float:right;"><a
href="http://sourceforge.net/projects/linkchecker"><img
src="http://sflogo.sourceforge.net/sflogo.php?group_id=1913&type=13"
width="120" height="30" border="0"
alt="Get LinkChecker at SourceForge.net." /></a>
{# Piwik tag #}
<script type="text/javascript">
var pkBaseURL = (("https:" == document.location.protocol) ? "https:" : "http:") + "//apps.sourceforge.net/piwik/linkchecker/";
document.write(unescape("%3Cscript src='" + pkBaseURL + "piwik.js' type='text/javascript'%3E%3C/script%3E"));
</script><script type="text/javascript">
piwik_action_name = '';
piwik_idsite = 1;
piwik_url = pkBaseURL + "piwik.php";
piwik_log(piwik_action_name, piwik_idsite, piwik_url);
</script>
<object><noscript><p><img src="http://apps.sourceforge.net/piwik/linkchecker/piwik.php?idsite=1" alt=""/></p></noscript></object>
{# End Piwik tag #}
</div>
{% endif %}
<table border="0"><tr>
<td><a href="{{ pathto('index') }}"><img
src="{{ pathto("_static/logo64x64.png", 1) }}" border="0" alt="LinkChecker"/></a></td>
<td><h1>LinkChecker</h1></td>
</tr></table>
</div>
{{ super() }}
{% endblock %}
{% block relbar2 %}{% endblock %}
{# put the sidebar before the body #}
{% block sidebarsearch %}{{ super() }}{% endblock %}
{% block sidebar1 %}{{ sidebar() }}{% endblock %}
{% block sidebar2 %}{% endblock %}
{% block sidebarlogo %}{% if builder == 'html' %}
{% if pagename == 'index' %}
<h3>Download</h3>
<a href="http://prdownloads.sourceforge.net/linkchecker/LinkChecker-{{version}}.exe?download">LinkChecker&nbsp;{{version}}&nbsp;for&nbsp;Windows</a><br/>
<a href="http://prdownloads.sourceforge.net/linkchecker/LinkChecker-{{version}}.tar.gz?download">LinkChecker&nbsp;{{version}}&nbsp;source</a><br/>
<a href="http://linkchecker.git.sourceforge.net/git/gitweb.cgi?p=linkchecker;a=blob;f=ChangeLog.txt;hb=HEAD">Changelog</a><br/>
<h3>Support</h3>
<a href="http://sourceforge.net/tracker/?func=add&group_id=1913&atid=101913">Bug&nbsp;tracker</a><br/>
<a href="http://sourceforge.net/scm/?type=git&group_id=1913">Development&nbsp;repository</a><br/>
{% endif %}
{% endif %}
{% endblock %}
{% block sidebartoc %}{% endblock %}