mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-20 16:00:26 +00:00
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@815 e7d03fd6-7b0d-0410-9947-9c21f3af8025
139 lines
6.2 KiB
Text
139 lines
6.2 KiB
Text
Q1: LinkChecker produced an error, but my web page is ok with
|
|
Netscape/IE/Opera/...
|
|
Is this a bug in LinkChecker?
|
|
A1: Please check your web pages first. Are they really ok? Use
|
|
a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
|
|
Check if the web server is accepting HEAD requests as well.
|
|
Check if you are using a Proxy which produces the error.
|
|
|
|
|
|
Q2.1: I still get an error, but the page is definitely ok.
|
|
Q2.2: I get an error "too much redirections", but the URL displays ok in my
|
|
browser.
|
|
A2: The difference between LinkChecker and web browsers is the type of HTTP
|
|
request. LinkChecker does a HEAD request, browsers do a GET request of
|
|
the same URL. Now, a lot of servers are broken when it comes to HEAD
|
|
support. LinkChecker tries to detect some of them, but can not catch all.
|
|
Nevertheless, this is an error. If the URL is on your own web server,
|
|
make sure that your server is handling HEAD requests!
|
|
|
|
|
|
Q3: How can I tell LinkChecker which proxy to use?
|
|
A3: LinkChecker works transparently with proxies. In a Unix or Windows
|
|
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
|
|
environment variables to a URL that identifies the proxy server before
|
|
starting LinkChecker. For example
|
|
# http_proxy="http://www.someproxy.com:3128"
|
|
# export http_proxy
|
|
|
|
In a Macintosh environment, LinkChecker will retrieve proxy information
|
|
from Internet Config.
|
|
|
|
|
|
Q4: The link "mailto:john@company.com?subject=Hello John" is reported
|
|
as an error.
|
|
A4: You have to quote special characters (e.g. spaces) in the subject field.
|
|
The correct link should be "mailto:...?subject=Hello%20John"
|
|
Unfortunately browsers like IE and Netscape do not enforce this.
|
|
|
|
|
|
Q5: Has LinkChecker JavaScript support?
|
|
A5: No, it never will. JavaScript sucks. If your page is not
|
|
working without JS then your web design is broken.
|
|
Learn PHP or Zope or ASP, and use JavaScript just as an addon for your
|
|
web pages.
|
|
|
|
|
|
Q6: I have a pretty large site to check. How can I restrict link checking
|
|
to check only my own pages?
|
|
A6: Look at the options --intern, --extern, --strict, --denyallow and
|
|
--recursion-level.
|
|
|
|
|
|
Q7: I dont get this --extern/--intern stuff.
|
|
A7: When it comes to checking there are three types of URLs:
|
|
1) strict extern URLs:
|
|
We do only syntax checking. Intern URLs are never strict.
|
|
2) extern URLs:
|
|
Like 1), but we additionally check if they are valid by connect()ing
|
|
to them
|
|
3) intern URLs:
|
|
Like 2), but we additionally check if they are HTML pages and if so,
|
|
we descend recursively into this link and check all the links in the
|
|
HTML content.
|
|
The --recursion-level option restricts the number of such recursive
|
|
descends.
|
|
|
|
LinkChecker provides four options which affect URLs to fall in one
|
|
of those three categories: --intern, --extern, --strict and
|
|
--denyallow.
|
|
By default all URLs are intern. With --extern you specify what URLs
|
|
are extern. With --intern you specify what URLs are intern.
|
|
Now imagine you have both --extern and --intern. What happens
|
|
when an URL matches both patterns? Or when it matches none? In this
|
|
situation the --denyallow option specifies the order in which we match
|
|
the URL. By default it is intern/extern, with --denyallow the order is
|
|
extern/intern. Either way, the first match counts, and if none matches,
|
|
the last checked category is the category for the URL.
|
|
Finally, with --strict all extern URLs are strict.
|
|
|
|
Oh, and just to boggle your mind: you can have more than one extern
|
|
regular expression in a config file and for each of those expressions
|
|
you can specify if those matched extern URLs should be strict or not.
|
|
|
|
An example. Assume we want to check only urls of our domains named
|
|
'mydomain.com' and 'myotherdomain.com'. Then we specify
|
|
-i'^http://my(other)?domain\.com' as intern regular expression, all other
|
|
urls are treated extern. Easy.
|
|
|
|
Another example. We dont want to check mailto urls. Then its
|
|
-i'!^mailto:'. The '!' negates an expression. With --strict, we dont
|
|
even connect to any mail hosts.
|
|
|
|
Yet another example. We check our site www.mycompany.com, dont recurse
|
|
into extern links point outside from our site and want to ignore links to
|
|
hollowood.com and hullabulla.com completely.
|
|
This can only be done with a configuration entry like
|
|
[filtering]
|
|
extern1=hollowood.com 1
|
|
extern2=hullabulla.com 1
|
|
# the 1 means strict extern ie dont even connect
|
|
and the command
|
|
linkchecker --intern=www.mycompany.com www.mycompany.com
|
|
|
|
|
|
Q8: Are Cookies insecure?
|
|
A8: Cookies can not store more information as is in the HTTP request itself,
|
|
so you are not giving away any more system information.
|
|
After storing however, the cookies are sent out to the server on request.
|
|
Not to every server, but only to the one who the cookie originated from!
|
|
This could be used to "track" subsequent requests to this server,
|
|
and this is what some people annoys (including me).
|
|
Cookies are only stored in memory. After LinkChecker finishes, they
|
|
are lost. So the tracking is restricted to the checking time.
|
|
|
|
|
|
Q9: I want to have my own logging class. How can I use it in LinkChecker?
|
|
A9: Currently, only a Python API lets you define new logging classes.
|
|
Define your own logging class as a subclass of StandardLogger or any other
|
|
logging class in the log module.
|
|
Then call the addLogger function in Config.Configuration to register
|
|
your new Logger.
|
|
After this append a new Logging instance to the fileoutput.
|
|
|
|
import linkcheck, MyLogger
|
|
log_format = 'mylog'
|
|
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
|
|
cfg = linkcheck.Config.Configuration()
|
|
cfg.addLogger(log_format, MyLogger.MyLogger)
|
|
cfg['fileoutput'].append(cfg.newLogger(log_format, log_args))
|
|
|
|
Q10.1: LinkChecker does not ignore anchor references on caching.
|
|
Q10.2: Some links with anchors are getting checked twice.
|
|
A10: This is not a bug.
|
|
Its common practice to believe that if an URL ABC#anchor1 works then
|
|
ABC#anchor2 works too. Thats not specified anywhere and I have seen
|
|
server-side scripts that fail on some anchors and not on others.
|
|
This is the reason for always checking URLs with different anchors.
|
|
If you really want to disable this, use --no-anchor-caching.
|
|
|