mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-03-19 15:30:29 +00:00
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1243 e7d03fd6-7b0d-0410-9947-9c21f3af8025
152 lines
6.8 KiB
Text
152 lines
6.8 KiB
Text
Q1: LinkChecker produced an error, but my web page is ok with
|
|
Netscape/IE/Opera/...
|
|
Is this a bug in LinkChecker?
|
|
A1: Please check your web pages first. Are they really ok? Use
|
|
a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
|
|
Check if you are using a proxy which produces the error.
|
|
|
|
|
|
Q2.1: I still get an error, but the page is definitely ok.
|
|
A2: Some servers deny access of automated tools (also called robots)
|
|
like LinkChecker. This is not a bug in LinkChecker but rather a
|
|
policy by the webmaster running the website you are checking.
|
|
It might even be possible for a website to send robots different
|
|
web pages than normal browsers.
|
|
|
|
|
|
Q3: How can I tell LinkChecker which proxy to use?
|
|
A3: LinkChecker works transparently with proxies. In a Unix or Windows
|
|
environment, set the http_proxy, https_proxy, ftp_proxy or gopher_proxy
|
|
environment variables to a URL that identifies the proxy server before
|
|
starting LinkChecker. For example
|
|
# http_proxy="http://www.someproxy.com:3128"
|
|
# export http_proxy
|
|
|
|
In a Macintosh environment, LinkChecker will retrieve proxy information
|
|
from Internet Config.
|
|
|
|
|
|
Q4: The link "mailto:john@company.com?subject=Hello John" is reported
|
|
as an error.
|
|
A4: You have to quote special characters (e.g. spaces) in the subject field.
|
|
The correct link should be "mailto:...?subject=Hello%20John"
|
|
Unfortunately browsers like IE and Netscape do not enforce this.
|
|
|
|
|
|
Q5: Has LinkChecker JavaScript support?
|
|
A5: No, it never will. If your page is not working without JS then your
|
|
web design is broken.
|
|
Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
|
|
an addon for your web pages.
|
|
|
|
|
|
Q6: I have a pretty large site to check. How can I restrict link checking
|
|
to check only my own pages?
|
|
A6: Look at the options --intern, --extern, --strict, --denyallow and
|
|
--recursion-level.
|
|
|
|
|
|
Q7: I don't get this --extern/--intern stuff.
|
|
A7: When it comes to checking there are three types of URLs:
|
|
1) strict external URLs:
|
|
We do only syntax checking. Internal URLs are never strict.
|
|
2) external URLs:
|
|
Like 1), but we additionally check if they are valid by connect()ing
|
|
to them
|
|
3) internal URLs:
|
|
Like 2), but we additionally check if they are HTML pages and if so,
|
|
we descend recursively into this link and check all the links in the
|
|
HTML content.
|
|
The --recursion-level option restricts the number of such recursive
|
|
descends.
|
|
|
|
LinkChecker provides four options which affect URLs to fall in one
|
|
of those three categories: --intern, --extern, --strict and
|
|
--denyallow.
|
|
By default all URLs are internal. With --extern you specify what URLs
|
|
are external. With --intern you specify what URLs are internal.
|
|
Now imagine you have both --extern and --intern. What happens
|
|
when an URL matches both patterns? Or when it matches none? In this
|
|
situation the --denyallow option specifies the order in which we match
|
|
the URL. By default it is internal/external, with --denyallow the order is
|
|
external/internal. Either way, the first match counts, and if none matches,
|
|
the last checked category is the category for the URL.
|
|
Finally, with --strict all external URLs are strict.
|
|
|
|
Oh, and just to boggle your mind: you can have more than one external
|
|
regular expression in a config file and for each of those expressions
|
|
you can specify if those matched external URLs should be strict or not.
|
|
|
|
An example. Assume we want to check only urls of our domains named
|
|
'mydomain.com' and 'myotherdomain.com'. Then we specify
|
|
-i'^http://my(other)?domain\.com' as internal regular expression, all other
|
|
urls are treated external. Easy.
|
|
|
|
Another example. We don't want to check mailto urls. Then its
|
|
-i'!^mailto:'. The '!' negates an expression. With --strict, we don't
|
|
even connect to any mail hosts.
|
|
|
|
Yet another example. We check our site www.mycompany.com, don't recurse
|
|
into external links point outside from our site and want to ignore links to
|
|
hollowood.com and hullabulla.com completely.
|
|
This can only be done with a configuration entry like
|
|
[filtering]
|
|
extern1=hollowood.com 1
|
|
extern2=hullabulla.com 1
|
|
# the 1 means strict external ie don't even connect
|
|
and the command
|
|
linkchecker --intern=www.mycompany.com www.mycompany.com
|
|
|
|
|
|
Q8: Is LinkCheckers cookie feature insecure?
|
|
A8: Cookies can not store more information as is in the HTTP request itself,
|
|
so you are not giving away any more system information.
|
|
After storing however, the cookies are sent out to the server on request.
|
|
Not to every server, but only to the one who the cookie originated from!
|
|
This could be used to "track" subsequent requests to this server,
|
|
and this is what some people annoys (including me).
|
|
Cookies are only stored in memory. After LinkChecker finishes, they
|
|
are lost. So the tracking is restricted to the checking time.
|
|
The cookie feature is disabled as default.
|
|
|
|
|
|
Q9: I want to have my own logging class. How can I use it in LinkChecker?
|
|
A9: Currently, only a Python API lets you define new logging classes.
|
|
Define your own logging class as a subclass of StandardLogger or any other
|
|
logging class in the log module.
|
|
Then call the addLogger function in Config.Configuration to register
|
|
your new Logger.
|
|
After this append a new Logging instance to the fileoutput.
|
|
|
|
import linkcheck, MyLogger
|
|
log_format = 'mylog'
|
|
log_args = {'fileoutput': log_format, 'filename': 'foo.txt'}
|
|
cfg = linkcheck.Config.Configuration()
|
|
cfg.addLogger(log_format, MyLogger.MyLogger)
|
|
cfg['fileoutput'].append(cfg.newLogger(log_format, log_args))
|
|
|
|
|
|
Q10.1: LinkChecker does not ignore anchor references on caching.
|
|
Q10.2: Some links with anchors are getting checked twice.
|
|
A10: This is not a bug.
|
|
It is common practice to believe that if an URL ABC#anchor1 works then
|
|
ABC#anchor2 works too. That is not specified anywhere and I have seen
|
|
server-side scripts that fail on some anchors and not on others.
|
|
This is the reason for always checking URLs with different anchors.
|
|
If you really want to disable this, use --no-anchor-caching.
|
|
|
|
|
|
Q11: I see LinkChecker gets a "/robots.txt" file for every site it
|
|
checks. What is that about?
|
|
A11: LinkChecker follows the robots.txt exclusion standard. To avoid
|
|
misuse of LinkChecker, you cannot turn this feature off.
|
|
See http://www.robotstxt.org/wc/robots.html and
|
|
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
|
|
for more info.
|
|
|
|
|
|
Q12: Ctrl-C does not stop LinkChecker immediately. Why is that so?
|
|
A12: The Python interpreter has to wait for all threads to finish, and
|
|
this means waiting for all open sockets to close. The default timeout
|
|
for sockets is 30 seconds, hence the delay.
|
|
You can change the default socket timeout with the --timeout option.
|