mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-05-16 10:33:09 +00:00
updated
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@978 e7d03fd6-7b0d-0410-9947-9c21f3af8025
This commit is contained in:
parent
6af1bee069
commit
0500d85c74
4 changed files with 44 additions and 38 deletions
|
|
@ -1,10 +1,13 @@
|
|||
1.8.22
|
||||
* allow colons in HTML attribute names for namespaces
|
||||
* allow colons in HTML attribute names, used for namespaces
|
||||
Changed: linkcheck/parser/htmllex.l
|
||||
* fix match of intern patters with --denyallow enabled
|
||||
* fix match of intern patterns with --denyallow enabled
|
||||
Changed: linkcheck/UrlData.py
|
||||
* s/intern/internal/ and s/extern/external/ in the documentation
|
||||
Changed: linkchecker, linkchecker.1, FAQ
|
||||
* rename column "column" to "col" in SQL output, since "column" is
|
||||
a reserved keyword. Thanks Garvin Hicking for the hint.
|
||||
Changed: linkcheck/log/SQLLogger.py, create.sql
|
||||
|
||||
1.8.21
|
||||
* detect recursive redirections; the maximum of five redirections is
|
||||
|
|
|
|||
53
FAQ
53
FAQ
|
|
@ -4,7 +4,7 @@ Q1: LinkChecker produced an error, but my web page is ok with
|
|||
A1: Please check your web pages first. Are they really ok? Use
|
||||
a syntax highlighting editor! Use HTML Tidy from www.w3c.org!
|
||||
Check if the web server is accepting HEAD requests as well.
|
||||
Check if you are using a Proxy which produces the error.
|
||||
Check if you are using a proxy which produces the error.
|
||||
|
||||
|
||||
Q2.1: I still get an error, but the page is definitely ok.
|
||||
|
|
@ -38,10 +38,10 @@ A4: You have to quote special characters (e.g. spaces) in the subject field.
|
|||
|
||||
|
||||
Q5: Has LinkChecker JavaScript support?
|
||||
A5: No, it never will. JavaScript sucks. If your page is not
|
||||
working without JS then your web design is broken.
|
||||
Learn PHP or Zope or ASP, and use JavaScript just as an addon for your
|
||||
web pages.
|
||||
A5: No, it never will. If your page is not working without JS then your
|
||||
web design is broken.
|
||||
Use PHP or Zope or ASP for dynamic content, and use JavaScript just as
|
||||
an addon for your web pages.
|
||||
|
||||
|
||||
Q6: I have a pretty large site to check. How can I restrict link checking
|
||||
|
|
@ -52,12 +52,12 @@ A6: Look at the options --intern, --extern, --strict, --denyallow and
|
|||
|
||||
Q7: I don't get this --extern/--intern stuff.
|
||||
A7: When it comes to checking there are three types of URLs:
|
||||
1) strict extern URLs:
|
||||
We do only syntax checking. Intern URLs are never strict.
|
||||
2) extern URLs:
|
||||
1) strict external URLs:
|
||||
We do only syntax checking. Internal URLs are never strict.
|
||||
2) external URLs:
|
||||
Like 1), but we additionally check if they are valid by connect()ing
|
||||
to them
|
||||
3) intern URLs:
|
||||
3) internal URLs:
|
||||
Like 2), but we additionally check if they are HTML pages and if so,
|
||||
we descend recursively into this link and check all the links in the
|
||||
HTML content.
|
||||
|
|
@ -67,42 +67,42 @@ A7: When it comes to checking there are three types of URLs:
|
|||
LinkChecker provides four options which affect URLs to fall in one
|
||||
of those three categories: --intern, --extern, --strict and
|
||||
--denyallow.
|
||||
By default all URLs are intern. With --extern you specify what URLs
|
||||
are extern. With --intern you specify what URLs are intern.
|
||||
By default all URLs are internal. With --extern you specify what URLs
|
||||
are external. With --intern you specify what URLs are internal.
|
||||
Now imagine you have both --extern and --intern. What happens
|
||||
when an URL matches both patterns? Or when it matches none? In this
|
||||
situation the --denyallow option specifies the order in which we match
|
||||
the URL. By default it is intern/extern, with --denyallow the order is
|
||||
extern/intern. Either way, the first match counts, and if none matches,
|
||||
the URL. By default it is internal/external, with --denyallow the order is
|
||||
external/internal. Either way, the first match counts, and if none matches,
|
||||
the last checked category is the category for the URL.
|
||||
Finally, with --strict all extern URLs are strict.
|
||||
Finally, with --strict all external URLs are strict.
|
||||
|
||||
Oh, and just to boggle your mind: you can have more than one extern
|
||||
Oh, and just to boggle your mind: you can have more than one external
|
||||
regular expression in a config file and for each of those expressions
|
||||
you can specify if those matched extern URLs should be strict or not.
|
||||
you can specify if those matched external URLs should be strict or not.
|
||||
|
||||
An example. Assume we want to check only urls of our domains named
|
||||
'mydomain.com' and 'myotherdomain.com'. Then we specify
|
||||
-i'^http://my(other)?domain\.com' as intern regular expression, all other
|
||||
urls are treated extern. Easy.
|
||||
-i'^http://my(other)?domain\.com' as internal regular expression, all other
|
||||
urls are treated external. Easy.
|
||||
|
||||
Another example. We don't want to check mailto urls. Then its
|
||||
-i'!^mailto:'. The '!' negates an expression. With --strict, we don't
|
||||
even connect to any mail hosts.
|
||||
|
||||
Yet another example. We check our site www.mycompany.com, don't recurse
|
||||
into extern links point outside from our site and want to ignore links to
|
||||
into external links point outside from our site and want to ignore links to
|
||||
hollowood.com and hullabulla.com completely.
|
||||
This can only be done with a configuration entry like
|
||||
[filtering]
|
||||
extern1=hollowood.com 1
|
||||
extern2=hullabulla.com 1
|
||||
# the 1 means strict extern ie don't even connect
|
||||
# the 1 means strict external ie don't even connect
|
||||
and the command
|
||||
linkchecker --intern=www.mycompany.com www.mycompany.com
|
||||
|
||||
|
||||
Q8: Are Cookies insecure?
|
||||
Q8: Is LinkCheckers cookie feature insecure?
|
||||
A8: Cookies can not store more information as is in the HTTP request itself,
|
||||
so you are not giving away any more system information.
|
||||
After storing however, the cookies are sent out to the server on request.
|
||||
|
|
@ -111,6 +111,7 @@ A8: Cookies can not store more information as is in the HTTP request itself,
|
|||
and this is what some people annoys (including me).
|
||||
Cookies are only stored in memory. After LinkChecker finishes, they
|
||||
are lost. So the tracking is restricted to the checking time.
|
||||
The cookie feature is disabled as default.
|
||||
|
||||
|
||||
Q9: I want to have my own logging class. How can I use it in LinkChecker?
|
||||
|
|
@ -139,8 +140,10 @@ A10: This is not a bug.
|
|||
If you really want to disable this, use --no-anchor-caching.
|
||||
|
||||
|
||||
Q11: I see linkchecker gets a "/robots.txt" file for every site it
|
||||
Q11: I see LinkChecker gets a "/robots.txt" file for every site it
|
||||
checks. What is that about?
|
||||
A11: See http://www.robotstxt.org/wc/robots.html and
|
||||
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
|
||||
for more info.
|
||||
A11: LinkChecker follows the robots.txt exclusion standard. To avoid
|
||||
misuse of LinkChecker, you cannot turn this feature off.
|
||||
See http://www.robotstxt.org/wc/robots.html and
|
||||
http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt
|
||||
for more info.
|
||||
|
|
|
|||
4
debian/README.Debian
vendored
4
debian/README.Debian
vendored
|
|
@ -1,7 +1,7 @@
|
|||
On Debian systems, you have a simple CGI script located at
|
||||
http://localhost/lconline/index.html
|
||||
For this to work, your web server must have content negotiation enabled.
|
||||
Or you have to remove for one language XY the file extensions of the .html
|
||||
files.
|
||||
Or you have to remove for one language XY the file extensions of the
|
||||
.html.XY files.
|
||||
|
||||
For installation of a FastCGI script instead of the above, see README.
|
||||
|
|
|
|||
|
|
@ -30,8 +30,8 @@ For single-letter option arguments the space is not a necessity.
|
|||
So \fI-o colored\fP is the same as \fI-ocolored\fP.
|
||||
.TP
|
||||
\fB-a\fP, \fB--anchors\fP
|
||||
Check HTTP anchor references. This option applies to both intern
|
||||
and extern urls. Default is don't check anchors.
|
||||
Check HTTP anchor references. This option applies to both internal
|
||||
and external urls. Default is don't check anchors.
|
||||
This option implies -w because anchor errors are always warnings.
|
||||
.TP
|
||||
\fB-C\fP, \fB--cookies\fP
|
||||
|
|
@ -41,16 +41,16 @@ Sent and accepted cookies are provided as additional logging
|
|||
information.
|
||||
.TP
|
||||
\fB-d\fP, \fB--denyallow\fP
|
||||
Swap checking order to extern/intern. Default checking order is
|
||||
intern/extern.
|
||||
Swap checking order to external/internal. Default checking order is
|
||||
internal/external.
|
||||
.TP
|
||||
\fB-D\fP, \fB--debug\fP
|
||||
Print debugging information. Provide this option multiple times
|
||||
for even more debugging information.
|
||||
.TP
|
||||
\fB-e \fIregex\fP, \fB--extern=\fIregex\fP
|
||||
Assume urls that match the given regular expression as extern.
|
||||
Only intern HTML links are checked recursively.
|
||||
Assume urls that match the given regular expression as external.
|
||||
Only internal HTML links are checked recursively.
|
||||
.TP
|
||||
\fB-f \fIfile\fP, \fB--config=\fIfile\fP
|
||||
Use \fIfile\fP as configuration file. As default LinkChecker first searches
|
||||
|
|
@ -67,8 +67,8 @@ output.
|
|||
Ask for url if none are given on the commandline.
|
||||
.TP
|
||||
\fB-i \fIregex\fP, \fB--intern=\fIregex\fP
|
||||
Assume URLs that match the given regular expression as intern.
|
||||
LinkChecker descends recursively only to intern URLs, not to extern.
|
||||
Assume URLs that match the given regular expression as internal.
|
||||
LinkChecker descends recursively only to internal URLs, not to external.
|
||||
.TP
|
||||
\fB-h\fP, \fB--help\fP
|
||||
Help me! Print usage information for this program.
|
||||
|
|
@ -106,7 +106,7 @@ A negative depth will enable inifinite recursion.
|
|||
Default depth is 1.
|
||||
.TP
|
||||
\fB-s\fP, \fB--strict\fP
|
||||
Check only the syntax of extern links, do not try to connect to them.
|
||||
Check only the syntax of external links, do not try to connect to them.
|
||||
For local file urls, only local files are internal. For
|
||||
http and ftp urls, all urls at the same domain name are internal.
|
||||
.TP
|
||||
|
|
|
|||
Loading…
Reference in a new issue