mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-20 16:00:26 +00:00
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@487 e7d03fd6-7b0d-0410-9947-9c21f3af8025
114 lines
5 KiB
Text
114 lines
5 KiB
Text
Q: LinkChecker produced an error, but my web page is ok with
|
|
Netscape/IE/Opera/...
|
|
Is this a bug in LinkChecker?
|
|
A: Please check your web pages first. Are they really ok? Use
|
|
a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
|
|
Check if the web server is accepting HEAD requests as well.
|
|
|
|
|
|
Q: I still get an error, but the page is definitely ok.
|
|
Q: I get an error "too much redirections", but the URL displays ok in my
|
|
browser.
|
|
A: The difference between LinkChecker and web browsers is the type of HTTP
|
|
request. LinkChecker does a HEAD request, browsers do a GET request of
|
|
the same URL. Now, a lot of servers are broken when it comes to HEAD
|
|
support. LinkChecker tries to detect some of them, but can not catch all.
|
|
Nevertheless, this is an error. If the URL is on your own web server,
|
|
go check that your server is handling HEAD requests!
|
|
|
|
|
|
Q: How can I tell LinkChecker which proxy to use?
|
|
A: LinkChecker works transparently with proxies. In a Unix or Windows
|
|
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
|
|
environment variables to a URL that identifies the proxy server before
|
|
starting LinkChecker. For example
|
|
# http_proxy="http://www.someproxy.com:3128"
|
|
# export http_proxy
|
|
|
|
In a Macintosh environment, LinkChecker will retrieve proxy information
|
|
from Internet Config.
|
|
|
|
|
|
Q: The link "mailto:john@company.com?subject=Hello John" is reported
|
|
as an error.
|
|
A: You have to quote special characters (e.g. spaces) in the subject field.
|
|
The correct link should be "mailto:...?subject=Hello%20John"
|
|
Unfortunately browsers like IE and Netscape do not enforce this.
|
|
|
|
|
|
Q: Has LinkChecker JavaScript support?
|
|
A: No, it never will. JavaScript sucks. If your page is not
|
|
working without JS then your web design is broken.
|
|
Learn PHP or Zope or ASP, and use JavaScript just as an addon for your
|
|
web pages.
|
|
|
|
|
|
Q: I have a pretty large site to check. How can I restrict link checking
|
|
to check only my own pages?
|
|
A: Look at the options --intern, --extern, --strict, --denyallow and
|
|
--recursion-level.
|
|
|
|
|
|
Q: I dont get this --extern/--intern stuff.
|
|
A: When it comes to checking there are three types of URLs:
|
|
1) strict URLs:
|
|
we do only syntax checking
|
|
2) extern URLs:
|
|
like 1), but we additionally check if they are valid by connect()ing
|
|
to them
|
|
3) intern URLs:
|
|
like 2), but we additionally check if they are HTML pages and if so,
|
|
we descend recursively into this link and check all the links in the
|
|
HTML content.
|
|
The --recursion-level option restricts the number of such recursive
|
|
descends.
|
|
|
|
LinkChecker provides four options which affect URLs to fall in one
|
|
of those three categories: --intern, --extern, --strict and
|
|
--denyallow.
|
|
By default all URLs are intern. With --extern you specify what URLs
|
|
are extern. With --intern you specify what URLs are intern.
|
|
Now imagine you have both --extern and --intern. What happens
|
|
when an URL matches both patterns? Or when it matches none? In this
|
|
situation the --denyallow option specifies the order in which we match
|
|
the URL. By default it is intern/extern, with --denyallow the order is
|
|
extern/intern. Either way, the first match counts, and if none matches,
|
|
the last checked category is the category for the URL.
|
|
Finally, with --strict all extern URLs are strict.
|
|
|
|
Oh, and just to boggle your mind: you can have more than one extern
|
|
regular expression in a config file and for each of those expressions
|
|
you can specify if those matched extern URLs should be strict or not.
|
|
|
|
An example. Assume we want to check only urls of our domains named
|
|
'mydomain.com' and 'myotherdomain.com'. Then we specify
|
|
-i'^http://my(other)?domain\.com' as intern regular expression, all other
|
|
urls are treated extern. Easy.
|
|
|
|
|
|
Q: Are Cookies insecure?
|
|
A: Cookies can not store more information as is in the HTTP request itself,
|
|
so you are not giving away any more system information.
|
|
After storing however, the cookies are sent out to the server on request.
|
|
Not to every server, but only to the one who the cookie originated from!
|
|
This could be used to "track" subsequent requests to this server,
|
|
and this is what some people annoys (including me).
|
|
Cookies are only stored in memory. After LinkChecker finishes, they
|
|
are lost. So the tracking is restricted to the checking time.
|
|
|
|
|
|
Q: I want to have my own logging class. How can I use it in LinkChecker?
|
|
A: Currently, only a Python API lets you define new logging classes.
|
|
Define your own logging class as a subclass of StandardLogger or any other
|
|
logging class in the log module.
|
|
Then call the addLogger function in Config.Configuration to register
|
|
your new Logger.
|
|
After this append a new Logging instance to the fileoutput.
|
|
|
|
import linkcheck, MyLogger
|
|
log_format = 'mylog'
|
|
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
|
|
cfg = linkcheck.Config.Configuration()
|
|
cfg.addLogger(log_format, MyLogger.MyLogger)
|
|
cfg['fileoutput'].append(cfg.newLogger(log_format, log_args))
|
|
|