linkchecker/doc/src/faq.rst
2023-08-28 19:32:28 +01:00

131 lines
4.4 KiB
ReStructuredText

:github_url: https://github.com/linkchecker/linkchecker/blob/master/doc/src/faq.rst
Frequently Asked Questions
==========================
**Q: LinkChecker produced an error, but my web page is okay with
Mozilla/IE/Opera/... Is this a bug in LinkChecker?**
A: Please check your web pages first. Are they really okay?
Often the major browsers are very forgiving and good at handling HTML
or HTTP errors, while LinkChecker complains in most cases of invalid
content.
Enable the :ref:`man/linkcheckerrc:HtmlSyntaxCheck` plugin,
or check if you are using a proxy which produces the error.
**Q: I still get an error, but the page is definitely okay.**
A: Some servers deny access to automated tools (also called robots)
like LinkChecker. This is not a bug in LinkChecker but rather a
policy by the webmaster running the website you are checking. Look in
the ``/robots.txt`` file which follows the
`robots.txt exclusion standard <http://www.robotstxt.org/robotstxt.html>`_.
For identification LinkChecker adds to each request a User-Agent header
like this::
Mozilla/5.0 (compatible; LinkChecker/9.4; +https://linkchecker.github.io/linkchecker/)
If you yourself are the webmaster, consider allowing LinkChecker to
check your web pages by adding the following to your robots.txt file::
User-Agent: LinkChecker
Allow: /
**Q: How can I tell LinkChecker which proxy to use?**
A: LinkChecker works automatically with proxies. In a Unix or Windows
environment, set the http_proxy or https_proxy environment
variables to a URL that identifies the proxy server before starting
LinkChecker. For example:
.. code-block:: console
$ http_proxy="http://www.example.com:3128"
$ export http_proxy
**Q: The link "mailto:john@company.com?subject=Hello John" is reported
as an error.**
A: You have to quote special characters (e.g. spaces) in the subject field.
The correct link should be "mailto:...?subject=Hello%20John"
Unfortunately browsers like IE and Netscape do not enforce this.
**Q: Has LinkChecker JavaScript support?**
A: No, it never will. If your page is only working with JS, it is
better to use a browser testing tool like `Selenium <http://seleniumhq.org/>`_.
**Q: Is the LinkCheckers cookie feature insecure?**
A: Potentially yes. This depends on what information you specify in the
cookie file. The cookie information will be sent to the specified
hosts.
Also, the following restrictions apply for cookies that LinkChecker
receives from the hosts it check:
- Cookies will only be sent back to the originating server (i.e. no
third party cookies are allowed).
- Cookies are only stored in memory. After LinkChecker finishes, they
are lost.
- The cookie feature is disabled as default.
**Q: LinkChecker retrieves a /robots.txt file for every site it
checks. What is that about?**
A: LinkChecker follows the
`robots.txt exclusion standard <http://www.robotstxt.org/robotstxt.html>`_.
To avoid misuse of LinkChecker, you cannot turn this feature off.
See the `Web Robot pages <http://www.robotstxt.org/robotstxt.html>`_ and the
`Spidering report <http://www.w3.org/Search/9605-Indexing-Workshop/ReportOutcomes/Spidering.txt>`_
for more info.
If you yourself are the webmaster, consider allowing LinkChecker to
check your web pages by adding the following to your robots.txt file::
User-Agent: LinkChecker
Allow: /
**Q: How do I print unreachable/dead documents of my website with
LinkChecker?**
A: No can do. This would require file system access to your web
repository and access to your web server configuration.
**Q: How do I check HTML/XML/CSS syntax with LinkChecker?**
A: Enable the :ref:`man/linkcheckerrc:HtmlSyntaxCheck` and
:ref:`man/linkcheckerrc:CssSyntaxCheck` plugins.
**Q: I want to have my own logging class. How can I use it in LinkChecker?**
A: A Python API lets you define new logging classes.
Define your own logging class as a subclass of *_Logger* or any other
logging class in the *log* module.
Then call the *add_logger* function in *Config.Configuration* to register
your new Logger.
After this append a new Logging instance to the fileoutput.
.. code-block:: python
import linkcheck
class MyLogger(linkcheck.logger._Logger):
LoggerName = 'mylog'
LoggerArgs = {'fileoutput': log_format, 'filename': 'foo.txt'}
# ...
cfg = linkcheck.configuration.Configuration()
cfg.logger_add(MyLogger)
cfg['fileoutput'].append(cfg.logger_new(MyLogger.LoggerName))