git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@978 e7d03fd6-7b0d-0410-9947-9c21f3af8025
This commit is contained in:
calvin 2003-08-05 13:01:06 +00:00
parent 6af1bee069
commit 0500d85c74
4 changed files with 44 additions and 38 deletions

View file

@ -1,10 +1,13 @@
1.8.22
* allow colons in HTML attribute names for namespaces
* allow colons in HTML attribute names, used for namespaces
Changed: linkcheck/parser/htmllex.l
* fix match of intern patters with --denyallow enabled
* fix match of intern patterns with --denyallow enabled
Changed: linkcheck/UrlData.py
* s/intern/internal/ and s/extern/external/ in the documentation
Changed: linkchecker, linkchecker.1, FAQ
* rename column "column" to "col" in SQL output, since "column" is
a reserved keyword. Thanks Garvin Hicking for the hint.
Changed: linkcheck/log/SQLLogger.py, create.sql
1.8.21
* detect recursive redirections; the maximum of five redirections is

53
FAQ
View file

@ -4,7 +4,7 @@ Q1: LinkChecker produced an error, but my web page is ok with
A1: Please check your web pages first. Are they really ok? Use
a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
Check if the web server is accepting HEAD requests as well.
Check if you are using a Proxy which produces the error.
Check if you are using a proxy which produces the error.
Q2.1: I still get an error, but the page is definitely ok.
@ -38,10 +38,10 @@ A4: You have to quote special characters (e.g. spaces) in the subject field.
Q5: Has LinkChecker JavaScript support?
A5: No, it never will. JavaScript sucks. If your page is not
working without JS then your web design is broken.
Learn PHP or Zope or ASP, and use JavaScript just as an addon for your
web pages.
A5: No, it never will. If your page is not working without JS then your
web design is broken.
Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
an addon for your web pages.
Q6: I have a pretty large site to check. How can I restrict link checking
@ -52,12 +52,12 @@ A6: Look at the options --intern, --extern, --strict, --denyallow and
Q7: I don't get this --extern/--intern stuff.
A7: When it comes to checking there are three types of URLs:
1) strict extern URLs:
We do only syntax checking. Intern URLs are never strict.
2) extern URLs:
1) strict external URLs:
We do only syntax checking. Internal URLs are never strict.
2) external URLs:
Like 1), but we additionally check if they are valid by connect()ing
to them
3) intern URLs:
3) internal URLs:
Like 2), but we additionally check if they are HTML pages and if so,
we descend recursively into this link and check all the links in the
HTML content.
@ -67,42 +67,42 @@ A7: When it comes to checking there are three types of URLs:
LinkChecker provides four options which affect URLs to fall in one
of those three categories: --intern, --extern, --strict and
--denyallow.
By default all URLs are intern. With --extern you specify what URLs
are extern. With --intern you specify what URLs are intern.
By default all URLs are internal. With --extern you specify what URLs
are external. With --intern you specify what URLs are internal.
Now imagine you have both --extern and --intern. What happens
when an URL matches both patterns? Or when it matches none? In this
situation the --denyallow option specifies the order in which we match
the URL. By default it is intern/extern, with --denyallow the order is
extern/intern. Either way, the first match counts, and if none matches,
the URL. By default it is internal/external, with --denyallow the order is
external/internal. Either way, the first match counts, and if none matches,
the last checked category is the category for the URL.
Finally, with --strict all extern URLs are strict.
Finally, with --strict all external URLs are strict.
Oh, and just to boggle your mind: you can have more than one extern
Oh, and just to boggle your mind: you can have more than one external
regular expression in a config file and for each of those expressions
you can specify if those matched extern URLs should be strict or not.
you can specify if those matched external URLs should be strict or not.
An example. Assume we want to check only urls of our domains named
'mydomain.com' and 'myotherdomain.com'. Then we specify
-i'^http://my(other)?domain\.com' as intern regular expression, all other
urls are treated extern. Easy.
-i'^http://my(other)?domain\.com' as internal regular expression, all other
urls are treated external. Easy.
Another example. We don't want to check mailto urls. Then its
-i'!^mailto:'. The '!' negates an expression. With --strict, we don't
even connect to any mail hosts.
Yet another example. We check our site www.mycompany.com, don't recurse
into extern links point outside from our site and want to ignore links to
into external links point outside from our site and want to ignore links to
hollowood.com and hullabulla.com completely.
This can only be done with a configuration entry like
[filtering]
extern1=hollowood.com 1
extern2=hullabulla.com 1
# the 1 means strict extern ie don't even connect
# the 1 means strict external ie don't even connect
and the command
linkchecker --intern=www.mycompany.com www.mycompany.com
Q8: Are Cookies insecure?
Q8: Is LinkCheckers cookie feature insecure?
A8: Cookies can not store more information as is in the HTTP request itself,
so you are not giving away any more system information.
After storing however, the cookies are sent out to the server on request.
@ -111,6 +111,7 @@ A8: Cookies can not store more information as is in the HTTP request itself,
and this is what some people annoys (including me).
Cookies are only stored in memory. After LinkChecker finishes, they
are lost. So the tracking is restricted to the checking time.
The cookie feature is disabled as default.
Q9: I want to have my own logging class. How can I use it in LinkChecker?
@ -139,8 +140,10 @@ A10: This is not a bug.
If you really want to disable this, use --no-anchor-caching.
Q11: I see linkchecker gets a "/robots.txt" file for every site it
Q11: I see LinkChecker gets a "/robots.txt" file for every site it
checks. What is that about?
A11: See http://www.robotstxt.org/wc/robots.html and
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
for more info.
A11: LinkChecker follows the robots.txt exclusion standard. To avoid
misuse of LinkChecker, you cannot turn this feature off.
See http://www.robotstxt.org/wc/robots.html and
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
for more info.

View file

@ -1,7 +1,7 @@
On Debian systems, you have a simple CGI script located at
http://localhost/lconline/index.html
For this to work, your web server must have content negotiation enabled.
Or you have to remove for one language XY the file extensions of the .html
files.
Or you have to remove for one language XY the file extensions of the
.html.XY files.
For installation of a FastCGI script instead of the above, see README.

View file

@ -30,8 +30,8 @@ For single-letter option arguments the space is not a necessity.
So \fI-o colored\fP is the same as \fI-ocolored\fP.
.TP
\fB-a\fP, \fB--anchors\fP
Check HTTP anchor references. This option applies to both intern
and extern urls. Default is don't check anchors.
Check HTTP anchor references. This option applies to both internal
and external urls. Default is don't check anchors.
This option implies -w because anchor errors are always warnings.
.TP
\fB-C\fP, \fB--cookies\fP
@ -41,16 +41,16 @@ Sent and accepted cookies are provided as additional logging
information.
.TP
\fB-d\fP, \fB--denyallow\fP
Swap checking order to extern/intern. Default checking order is
intern/extern.
Swap checking order to external/internal. Default checking order is
internal/external.
.TP
\fB-D\fP, \fB--debug\fP
Print debugging information. Provide this option multiple times
for even more debugging information.
.TP
\fB-e \fIregex\fP, \fB--extern=\fIregex\fP
Assume urls that match the given regular expression as extern.
Only intern HTML links are checked recursively.
Assume urls that match the given regular expression as external.
Only internal HTML links are checked recursively.
.TP
\fB-f \fIfile\fP, \fB--config=\fIfile\fP
Use \fIfile\fP as configuration file. As default LinkChecker first searches
@ -67,8 +67,8 @@ output.
Ask for url if none are given on the commandline.
.TP
\fB-i \fIregex\fP, \fB--intern=\fIregex\fP
Assume URLs that match the given regular expression as intern.
LinkChecker descends recursively only to intern URLs, not to extern.
Assume URLs that match the given regular expression as internal.
LinkChecker descends recursively only to internal URLs, not to external.
.TP
\fB-h\fP, \fB--help\fP
Help me! Print usage information for this program.
@ -106,7 +106,7 @@ A negative depth will enable inifinite recursion.
Default depth is 1.
.TP
\fB-s\fP, \fB--strict\fP
Check only the syntax of extern links, do not try to connect to them.
Check only the syntax of external links, do not try to connect to them.
For local file urls, only local files are internal. For
http and ftp urls, all urls at the same domain name are internal.
.TP