diff --git a/doc/html/Makefile b/doc/html/Makefile index bddf2d01..1713faeb 100644 --- a/doc/html/Makefile +++ b/doc/html/Makefile @@ -6,7 +6,10 @@ all: $(HELPFILES) clean: -rm -f *.qhc *.qch -%.qhc: %.qhcp lcdoc.qhp +%.html: %.txt html.header html.footer + (cat html.header; markdown2 $<; cat html.footer) > index.html + +%.qhc: %.qhcp lcdoc.qhp index.html qcollectiongenerator $< -o $@ favicon.ico: favicon32x32.png favicon16x16.png diff --git a/doc/html/html.footer b/doc/html/html.footer new file mode 100644 index 00000000..c136615c --- /dev/null +++ b/doc/html/html.footer @@ -0,0 +1,5 @@ + + + diff --git a/doc/html/html.header b/doc/html/html.header new file mode 100644 index 00000000..879e109d --- /dev/null +++ b/doc/html/html.header @@ -0,0 +1,24 @@ + + + + + + + Check websites for broken links + + + + + + +
+ + + +
LinkChecker

LinkChecker

+
+ diff --git a/doc/html/index.html b/doc/html/index.html index 4727c14e..4a0ec876 100644 --- a/doc/html/index.html +++ b/doc/html/index.html @@ -8,7 +8,6 @@ Check websites for broken links - @@ -16,206 +15,149 @@ img { border: 0; }
- - +
LinkCheckerLinkChecker

LinkChecker

-
-
-
- - -
+

Performed checks

+

All URLs have to pass a preliminary syntax test. Minor quoting mistakes will issue a warning, all other invalid syntax issues are errors. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.

+
    -
  • HTTP links (http:, https:)

    +
  • HTTP links (http:, https:)

    +

    After connecting to the given HTTP server the given path or query is requested. All redirections are followed, and if user/password is given it will be used as authorization when necessary. Permanently moved pages issue a warning. -All final HTTP status codes other than 2xx are errors.

    -
  • -
  • Local files (file:)

    +All final HTTP status codes other than 2xx are errors.

  • +
  • Local files (file:)

    +

    A regular, readable file that can be opened is valid. A readable directory is also valid. All other files, for example device files, unreadable or non-existing files are errors.

    -

    File contents are checked for recursion.

    -
  • -
  • Mail links (mailto:)

    + +

    File contents are checked for recursion.

  • +
  • Mail links (mailto:)

    +

    A mailto: link eventually resolves to a list of email addresses. If one address fails, the whole list will fail. For each mail address we check the following things:

    -
      -
    1. Check the adress syntax, both of the part before and after -the @ sign.
    2. -
    3. Look up the MX DNS records. If we found no MX record, -print an error.
    4. -
    5. Check if one of the mail hosts accept an SMTP connection. -Check hosts with higher priority first. -If no host accepts SMTP, we print a warning.
    6. -
    7. Try to verify the address with the VRFY command. If we got -an answer, print the verified address as an info.
    8. -
    -
  • -
  • FTP links (ftp:)

    + +

    1) Check the adress syntax, both of the part before and after + the @ sign. +2) Look up the MX DNS records. If we found no MX record, + print an error. +3) Check if one of the mail hosts accept an SMTP connection. + Check hosts with higher priority first. + If no host accepts SMTP, we print a warning. +4) Try to verify the address with the VRFY command. If we got + an answer, print the verified address as an info.

  • +
  • FTP links (ftp:)

    +

    For FTP links we do:

    -
      -
    1. connect to the specified host
    2. -
    3. try to login with the given user and password. The default -user is anonymous, the default password is anonymous@.
    4. -
    5. try to change to the given directory
    6. -
    7. list the file with the NLST command
    8. -
    -
  • -
  • Telnet links (telnet:)

    + +

    1) connect to the specified host +2) try to login with the given user and password. The default + user is anonymous, the default password is anonymous@. +3) try to change to the given directory +4) list the file with the NLST command

  • +
  • Telnet links (telnet:)

    +

    We try to connect and if user/password are given, login to the -given telnet server.

    -
  • -
  • NNTP links (news:, snews:, nntp)

    +given telnet server.

  • +
  • NNTP links (news:, snews:, nntp)

    +

    We try to connect to the given NNTP server. If a news group or -article is specified, try to request it from the server.

    -
  • -
  • Ignored links (javascript:, etc.)

    +article is specified, try to request it from the server.

  • +
  • Ignored links (javascript:, etc.)

    +

    An ignored link will only print a warning. No further checking will be made.

    +

    Here is a complete list of recognized, but ignored links. The most prominent of them should be JavaScript links.

    -
      -
    • acap: (application configuration access protocol)
    • -
    • afs: (Andrew File System global file names)
    • -
    • chrome: (Mozilla specific)
    • -
    • cid: (content identifier)
    • -
    • clsid: (Microsoft specific)
    • -
    • data: (data)
    • -
    • dav: (dav)
    • -
    • fax: (fax)
    • -
    • find: (Mozilla specific)
    • -
    • gopher: (Gopher)
    • -
    • imap: (internet message access protocol)
    • -
    • isbn: (ISBN (int. book numbers))
    • -
    • javascript: (JavaScript)
    • -
    • ldap: (Lightweight Directory Access Protocol)
    • -
    • mailserver: (Access to data available from mail servers)
    • -
    • mid: (message identifier)
    • -
    • mms: (multimedia stream)
    • -
    • modem: (modem)
    • -
    • nfs: (network file system protocol)
    • -
    • opaquelocktoken: (opaquelocktoken)
    • -
    • pop: (Post Office Protocol v3)
    • -
    • prospero: (Prospero Directory Service)
    • -
    • rsync: (rsync protocol)
    • -
    • rtsp: (real time streaming protocol)
    • -
    • service: (service location)
    • -
    • shttp: (secure HTTP)
    • -
    • sip: (session initiation protocol)
    • -
    • tel: (telephone)
    • -
    • tip: (Transaction Internet Protocol)
    • -
    • tn3270: (Interactive 3270 emulation sessions)
    • -
    • vemmi: (versatile multimedia interface)
    • -
    • wais: (Wide Area Information Servers)
    • -
    • z39.50r: (Z39.50 Retrieval)
    • -
    • z39.50s: (Z39.50 Session)
    • + +
        +
      • acap: (application configuration access protocol)
      • +
      • afs: (Andrew File System global file names)
      • +
      • chrome: (Mozilla specific)
      • +
      • cid: (content identifier)
      • +
      • clsid: (Microsoft specific)
      • +
      • data: (data)
      • +
      • dav: (dav)
      • +
      • fax: (fax)
      • +
      • find: (Mozilla specific)
      • +
      • gopher: (Gopher)
      • +
      • imap: (internet message access protocol)
      • +
      • isbn: (ISBN (int. book numbers))
      • +
      • javascript: (JavaScript)
      • +
      • ldap: (Lightweight Directory Access Protocol)
      • +
      • mailserver: (Access to data available from mail servers)
      • +
      • mid: (message identifier)
      • +
      • mms: (multimedia stream)
      • +
      • modem: (modem)
      • +
      • nfs: (network file system protocol)
      • +
      • opaquelocktoken: (opaquelocktoken)
      • +
      • pop: (Post Office Protocol v3)
      • +
      • prospero: (Prospero Directory Service)
      • +
      • rsync: (rsync protocol)
      • +
      • rtsp: (real time streaming protocol)
      • +
      • service: (service location)
      • +
      • shttp: (secure HTTP)
      • +
      • sip: (session initiation protocol)
      • +
      • tel: (telephone)
      • +
      • tip: (Transaction Internet Protocol)
      • +
      • tn3270: (Interactive 3270 emulation sessions)
      • +
      • vemmi: (versatile multimedia interface)
      • +
      • wais: (Wide Area Information Servers)
      • +
      • z39.50r: (Z39.50 Retrieval)
      • +
      • z39.50s: (Z39.50 Session)
      • +
    -
  • -
-
-
+

Recursion

+

Before descending recursively into a URL, it has to fulfill several conditions. They are checked in this order:

-
    -
  1. A URL must be valid.
  2. -
  3. A URL must be parseable. This currently includes HTML files, + +
      +
    1. A URL must be valid.

    2. +
    3. A URL must be parseable. This currently includes HTML files, Opera bookmarks files, and directories. If a file type cannot be determined (for example it does not have a common HTML file extension, and the content does not look like HTML), it is assumed -to be non-parseable.

    4. -
    5. The URL content must be retrievable. This is usually the case -except for example mailto: or unknown URL types.
    6. -
    7. The maximum recursion level must not be exceeded. It is configured -with the --recursion-level option and is unlimited per default.
    8. -
    9. It must not match the ignored URL list. This is controlled with -the --ignore-url option.
    10. -
    11. The Robots Exclusion Protocol must allow links in the URL to be +to be non-parseable.

    12. +
    13. The URL content must be retrievable. This is usually the case +except for example mailto: or unknown URL types.

    14. +
    15. The maximum recursion level must not be exceeded. It is configured +with the --recursion-level option and is unlimited per default.

    16. +
    17. It must not match the ignored URL list. This is controlled with +the --ignore-url option.

    18. +
    19. The Robots Exclusion Protocol must allow links in the URL to be followed recursively. This is checked by searching for a -“nofollow” directive in the HTML header data.

    20. +"nofollow" directive in the HTML header data.

    +

    Note that the directory recursion reads all files in that -directory, not just a subset like index.htm*.

    -
-
-

Frequently asked questions

-

Q: LinkChecker produced an error, but my web page is ok with -Mozilla/IE/Opera/... -Is this a bug in LinkChecker?

-

A: Please check your web pages first. Are they really ok? -Use the --check-html option, or check if you are using a proxy -which produces the error.

-

Q: I still get an error, but the page is definitely ok.

-

A: Some servers deny access of automated tools (also called robots) -like LinkChecker. This is not a bug in LinkChecker but rather a -policy by the webmaster running the website you are checking. Look -the /robots.txt file which follows the robots.txt exclusion standard.

-

Q: How can I tell LinkChecker which proxy to use?

-

A: LinkChecker works transparently with proxies. In a Unix or Windows -environment, set the http_proxy, https_proxy, ftp_proxy environment -variables to a URL that identifies the proxy server before starting -LinkChecker. For example

-
$ http_proxy="http://www.someproxy.com:3128"
-$ export http_proxy
-
-

Q: The link “mailto:john@company.com?subject=Hello John” is reported -as an error.

-

A: You have to quote special characters (e.g. spaces) in the subject field. -The correct link should be “mailto:...?subject=Hello%20John” -Unfortunately browsers like IE and Netscape do not enforce this.

-

Q: Has LinkChecker JavaScript support?

-

A: No, it never will. If your page is not working without JS, it is -better checked with a browser testing tool like Selenium.

-

Q: Is LinkCheckers cookie feature insecure?

-

A: If a cookie file is specified, the information will be sent -to the specified hosts. -The following restrictions apply for LinkChecker cookies:

-
    -
  • Cookies will only be sent to the originating server.
  • -
  • Cookies are only stored in memory. After LinkChecker finishes, they -are lost.
  • -
  • The cookie feature is disabled as default.
  • -
-

Q: I see LinkChecker gets a /robots.txt file for every site it -checks. What is that about?

-

A: LinkChecker follows the robots.txt exclusion standard. To avoid -misuse of LinkChecker, you cannot turn this feature off. -See the Web Robot pages and the Spidering report for more info.

-

Q: How do I print unreachable/dead documents of my website with -LinkChecker?

-

A: No can do. This would require file system access to your web -repository and access to your web server configuration.

-

Q: How do I check HTML/XML/CSS syntax with LinkChecker?

-

A: Use the --check-html and --check-css options.

-
-
- - -
-
-
- +directory, not just a subset like index.htm*.

diff --git a/doc/html/index.txt b/doc/html/index.txt new file mode 100644 index 00000000..514825db --- /dev/null +++ b/doc/html/index.txt @@ -0,0 +1,146 @@ +Documentation +============= + +Basic usage +----------- + +To check a URL like ``http://www.myhomepage.org/`` it is enough to +execute ``linkchecker http://www.myhomepage.org/``. This will check the +complete domain of www.myhomepage.org recursively. All links pointing +outside of the domain are also checked for validity. + +Performed checks +---------------- + +All URLs have to pass a preliminary syntax test. Minor quoting +mistakes will issue a warning, all other invalid syntax issues +are errors. +After the syntax check passes, the URL is queued for connection +checking. All connection check types are described below. + +- HTTP links (``http:``, ``https:``) + + After connecting to the given HTTP server the given path + or query is requested. All redirections are followed, and + if user/password is given it will be used as authorization + when necessary. + Permanently moved pages issue a warning. + All final HTTP status codes other than 2xx are errors. + +- Local files (``file:``) + + A regular, readable file that can be opened is valid. A readable + directory is also valid. All other files, for example device files, + unreadable or non-existing files are errors. + + File contents are checked for recursion. + +- Mail links (``mailto:``) + + A mailto: link eventually resolves to a list of email addresses. + If one address fails, the whole list will fail. + For each mail address we check the following things: + + 1) Check the adress syntax, both of the part before and after + the @ sign. + 2) Look up the MX DNS records. If we found no MX record, + print an error. + 3) Check if one of the mail hosts accept an SMTP connection. + Check hosts with higher priority first. + If no host accepts SMTP, we print a warning. + 4) Try to verify the address with the VRFY command. If we got + an answer, print the verified address as an info. + +- FTP links (``ftp:``) + + For FTP links we do: + + 1) connect to the specified host + 2) try to login with the given user and password. The default + user is ``anonymous``, the default password is ``anonymous@``. + 3) try to change to the given directory + 4) list the file with the NLST command + +- Telnet links (``telnet:``) + + We try to connect and if user/password are given, login to the + given telnet server. + +- NNTP links (``news:``, ``snews:``, ``nntp``) + + We try to connect to the given NNTP server. If a news group or + article is specified, try to request it from the server. + +- Ignored links (``javascript:``, etc.) + + An ignored link will only print a warning. No further checking + will be made. + + Here is a complete list of recognized, but ignored links. The most + prominent of them should be JavaScript links. + + - ``acap:`` (application configuration access protocol) + - ``afs:`` (Andrew File System global file names) + - ``chrome:`` (Mozilla specific) + - ``cid:`` (content identifier) + - ``clsid:`` (Microsoft specific) + - ``data:`` (data) + - ``dav:`` (dav) + - ``fax:`` (fax) + - ``find:`` (Mozilla specific) + - ``gopher:`` (Gopher) + - ``imap:`` (internet message access protocol) + - ``isbn:`` (ISBN (int. book numbers)) + - ``javascript:`` (JavaScript) + - ``ldap:`` (Lightweight Directory Access Protocol) + - ``mailserver:`` (Access to data available from mail servers) + - ``mid:`` (message identifier) + - ``mms:`` (multimedia stream) + - ``modem:`` (modem) + - ``nfs:`` (network file system protocol) + - ``opaquelocktoken:`` (opaquelocktoken) + - ``pop:`` (Post Office Protocol v3) + - ``prospero:`` (Prospero Directory Service) + - ``rsync:`` (rsync protocol) + - ``rtsp:`` (real time streaming protocol) + - ``service:`` (service location) + - ``shttp:`` (secure HTTP) + - ``sip:`` (session initiation protocol) + - ``tel:`` (telephone) + - ``tip:`` (Transaction Internet Protocol) + - ``tn3270:`` (Interactive 3270 emulation sessions) + - ``vemmi:`` (versatile multimedia interface) + - ``wais:`` (Wide Area Information Servers) + - ``z39.50r:`` (Z39.50 Retrieval) + - ``z39.50s:`` (Z39.50 Session) + + +Recursion +--------- + +Before descending recursively into a URL, it has to fulfill several +conditions. They are checked in this order: + +1. A URL must be valid. + +2. A URL must be parseable. This currently includes HTML files, + Opera bookmarks files, and directories. If a file type cannot + be determined (for example it does not have a common HTML file + extension, and the content does not look like HTML), it is assumed + to be non-parseable. + +3. The URL content must be retrievable. This is usually the case + except for example mailto: or unknown URL types. + +4. The maximum recursion level must not be exceeded. It is configured + with the ``--recursion-level`` option and is unlimited per default. + +5. It must not match the ignored URL list. This is controlled with + the ``--ignore-url`` option. + +6. The Robots Exclusion Protocol must allow links in the URL to be + followed recursively. This is checked by searching for a + "nofollow" directive in the HTML header data. + +Note that the directory recursion reads all files in that +directory, not just a subset like ``index.htm*``. diff --git a/doc/html/logo64x64.png b/doc/html/logo64x64.png old mode 100755 new mode 100644