linkchecker/FAQ
2002-07-14 10:16:09 +00:00

114 lines
5 KiB
Text

Q: LinkChecker produced an error, but my web page is ok with
Netscape/IE/Opera/...
Is this a bug in LinkChecker?
A: Please check your web pages first. Are they really ok? Use
a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
Check if the web server is accepting HEAD requests as well.
Q: I still get an error, but the page is definitely ok.
Q: I get an error "too much redirections", but the URL displays ok in my
browser.
A: The difference between LinkChecker and web browsers is the type of HTTP
request. LinkChecker does a HEAD request, browsers do a GET request of
the same URL. Now, a lot of servers are broken when it comes to HEAD
support. LinkChecker tries to detect some of them, but can not catch all.
Nevertheless, this is an error. If the URL is on your own web server,
go check that your server is handling HEAD requests!
Q: How can I tell LinkChecker which proxy to use?
A: LinkChecker works transparently with proxies. In a Unix or Windows
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
environment variables to a URL that identifies the proxy server before
starting LinkChecker. For example
# http_proxy="http://www.someproxy.com:3128"
# export http_proxy
In a Macintosh environment, LinkChecker will retrieve proxy information
from Internet Config.
Q: The link "mailto:john@company.com?subject=Hello John" is reported
as an error.
A: You have to quote special characters (e.g. spaces) in the subject field.
The correct link should be "mailto:...?subject=Hello%20John"
Unfortunately browsers like IE and Netscape do not enforce this.
Q: Has LinkChecker JavaScript support?
A: No, it never will. JavaScript sucks. If your page is not
working without JS then your web design is broken.
Learn PHP or Zope or ASP, and use JavaScript just as an addon for your
web pages.
Q: I have a pretty large site to check. How can I restrict link checking
to check only my own pages?
A: Look at the options --intern, --extern, --strict, --denyallow and
--recursion-level.
Q: I dont get this --extern/--intern stuff.
A: When it comes to checking there are three types of URLs:
1) strict URLs:
we do only syntax checking
2) extern URLs:
like 1), but we additionally check if they are valid by connect()ing
to them
3) intern URLs:
like 2), but we additionally check if they are HTML pages and if so,
we descend recursively into this link and check all the links in the
HTML content.
The --recursion-level option restricts the number of such recursive
descends.
LinkChecker provides four options which affect URLs to fall in one
of those three categories: --intern, --extern, --strict and
--denyallow.
By default all URLs are intern. With --extern you specify what URLs
are extern. With --intern you specify what URLs are intern.
Now imagine you have both --extern and --intern. What happens
when an URL matches both patterns? Or when it matches none? In this
situation the --denyallow option specifies the order in which we match
the URL. By default it is intern/extern, with --denyallow the order is
extern/intern. Either way, the first match counts, and if none matches,
the last checked category is the category for the URL.
Finally, with --strict all extern URLs are strict.
Oh, and just to boggle your mind: you can have more than one extern
regular expression in a config file and for each of those expressions
you can specify if those matched extern URLs should be strict or not.
An example. Assume we want to check only urls of our domains named
'mydomain.com' and 'myotherdomain.com'. Then we specify
-i'^http://my(other)?domain\.com' as intern regular expression, all other
urls are treated extern. Easy.
Q: Are Cookies insecure?
A: Cookies can not store more information as is in the HTTP request itself,
so you are not giving away any more system information.
After storing however, the cookies are sent out to the server on request.
Not to every server, but only to the one who the cookie originated from!
This could be used to "track" subsequent requests to this server,
and this is what some people annoys (including me).
Cookies are only stored in memory. After LinkChecker finishes, they
are lost. So the tracking is restricted to the checking time.
Q: I want to have my own logging class. How can I use it in LinkChecker?
A: Currently, only a Python API lets you define new logging classes.
Define your own logging class as a subclass of StandardLogger or any other
logging class in the log module.
Then call the addLogger function in Config.Configuration to register
your new Logger.
After this append a new Logging instance to the fileoutput.
import linkcheck, MyLogger
log_format = 'mylog'
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
cfg = linkcheck.Config.Configuration()
cfg.addLogger(log_format, MyLogger.MyLogger)
cfg['fileoutput'].append(cfg.newLogger(log_format, log_args))