linkchecker/doc/src/man/linkchecker.rst

:github_url: https://github.com/linkchecker/linkchecker/blob/master/doc/src/man/linkchecker.rst

linkchecker
===========

SYNOPSIS
--------

**linkchecker** [*options*] [*file-or-url*]...

DESCRIPTION
-----------

LinkChecker features

-  recursive and multithreaded checking
-  output in colored or normal text, HTML, SQL, CSV, XML or a sitemap
   graph in different formats
-  support for HTTP/1.1, HTTPS, FTP, mailto:, news:, nntp:, Telnet and
   local file links
-  restriction of link checking with URL filters
-  proxy support
-  username/password authorization for HTTP, FTP and Telnet
-  support for robots.txt exclusion protocol
-  support for Cookies
-  support for HTML5
-  Antivirus check
-  a command line and web interface

EXAMPLES
--------

The most common use checks the given domain recursively:

.. code-block:: console

   $ linkchecker http://www.example.com/

Beware that this checks the whole site which can have thousands of
URLs. Use the :option:`-r` option to restrict the recursion depth.

Don't check URLs with **/secret** in its name. All other links are
checked as usual:

.. code-block:: console

   $ linkchecker --ignore-url=/secret mysite.example.com

Checking a local HTML file on Unix:

.. code-block:: console

   $ linkchecker ../bla.html

Checking a local HTML file on Windows:

.. code-block:: doscon

   C:\> linkchecker c:empest.html

You can skip the **http://** url part if the domain starts with
**www.**:

.. code-block:: console

   $ linkchecker www.example.com

You can skip the **ftp://** url part if the domain starts with **ftp.**:

.. code-block:: console

   $ linkchecker -r0 ftp.example.com

Generate a sitemap graph and convert it with the graphviz dot utility:

.. code-block:: console

   $ linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps

OPTIONS
-------

General options
^^^^^^^^^^^^^^^

.. option:: -f FILENAME, --config=FILENAME

    Use FILENAME as configuration file. By default LinkChecker uses
    ~/.linkchecker/linkcheckerrc.

.. option:: -h, --help

    Help me! Print usage information for this program.

.. option:: --stdin

    Read list of white-space separated URLs to check from stdin.

.. option:: -t NUMBER, --threads=NUMBER

    Generate no more than the given number of threads. Default number of
    threads is 10. To disable threading specify a non-positive number.

.. option:: -V, --version

    Print version and exit.

.. option:: --list-plugins

    Print available check plugins and exit.

Output options
^^^^^^^^^^^^^^

URL checking results
""""""""""""""""""""

.. option:: -F TYPE[/ENCODING][/FILENAME], --file-output=TYPE[/ENCODING][/FILENAME]

    Output to a file linkchecker-out.TYPE,
    $HOME/.linkchecker/failures for the failures output type, or
    FILENAME if specified. The ENCODING specifies the output
    encoding, the default is that of your locale. Valid encodings are
    listed at
    https://docs.python.org/library/codecs.html#standard-encodings.
    The FILENAME and ENCODING parts of the none output type will
    be ignored, else if the file already exists, it will be overwritten.
    You can specify this option more than once. Valid file output TYPEs
    are text, html, sql, csv, gml, dot, xml,
    sitemap, none or failures. Default is no file output.
    The various output types are documented below. Note that you can
    suppress all console output with the option :option:`-o` *none*.

.. option:: --no-warnings

    Don't log warnings. Default is to log warnings.

.. option:: -o TYPE[/ENCODING], --output=TYPE[/ENCODING]

    Specify the console output type as text, html, sql, csv,
    gml, dot, xml, sitemap, none or failures.
    Default type is text. The various output types are documented below.
    The ENCODING specifies the output encoding, the default is that of
    your locale. Valid encodings are listed at
    https://docs.python.org/library/codecs.html#standard-encodings.

.. option:: -v, --verbose

    Log all checked URLs. Default is to log only errors and warnings.

Progress updates
""""""""""""""""

.. option:: --no-status

    Do not print URL check status messages.

Application
"""""""""""

.. option:: -D STRING, --debug=STRING

    Print debugging output for the given logger. Available loggers are
    cmdline, checking, cache, dns, plugin and
    all. Specifying all is an alias for specifying all available
    loggers. The option can be given multiple times to debug with more
    than one logger. For accurate results, threading will be disabled
    during debug runs.

Quiet
"""""

.. option:: -q, --quiet

    Quiet operation, an alias for :option:`-o` *none* that also hides
    application information messages.
    This is only useful with :option:`-F`, else no results will be output.

Checking options
^^^^^^^^^^^^^^^^

.. option:: --cookiefile=FILENAME

    Read a file with initial cookie data. The cookie data format is
    explained below.

.. option:: --check-extern

    Check external URLs.

.. option:: --ignore-url=REGEX

    URLs matching the given regular expression will only be syntax checked.
    This option can be given multiple times.
    See section `REGULAR EXPRESSIONS`_ for more info.

.. option:: -N STRING, --nntp-server=STRING

    Specify an NNTP server for news: links. Default is the
    environment variable :envvar:`NNTP_SERVER`. If no host is given, only the
    syntax of the link is checked.

.. option:: --no-follow-url=REGEX

    Check but do not recurse into URLs matching the given regular
    expression.
    This option can be given multiple times.
    See section `REGULAR EXPRESSIONS`_ for more info.

.. option:: --no-robots

    Check URLs regardless of any robots.txt files.

.. option:: -p, --password

    Read a password from console and use it for HTTP and FTP
    authorization. For FTP the default password is anonymous@. For
    HTTP there is no default password. See also :option:`-u`.

.. option:: -r NUMBER, --recursion-level=NUMBER

    Check recursively all links up to given depth. A negative depth will
    enable infinite recursion. Default depth is infinite.

.. option:: --timeout=NUMBER

    Set the timeout for connection attempts in seconds. The default
    timeout is 60 seconds.

.. option:: -u STRING, --user=STRING

    Try the given username for HTTP and FTP authorization. For FTP the
    default username is anonymous. For HTTP there is no default
    username. See also :option:`-p`.

.. option:: --user-agent=STRING

    Specify the User-Agent string to send to the HTTP server, for
    example "Mozilla/4.0". The default is "LinkChecker/X.Y" where X.Y is
    the current version of LinkChecker.

CONFIGURATION FILES
-------------------

Configuration files can specify all options above. They can also specify
some options that cannot be set on the command line. See
:manpage:`linkcheckerrc(5)` for more info.

OUTPUT TYPES
------------

Note that by default only errors and warnings are logged. You should use
the option :option:`--verbose` to get the complete URL list, especially when
outputting a sitemap graph format.

**text**
    Standard text logger, logging URLs in keyword: argument fashion.
**html**
    Log URLs in keyword: argument fashion, formatted as HTML.
    Additionally has links to the referenced pages. Invalid URLs have
    HTML and CSS syntax check links appended.
**csv**
    Log check result in CSV format with one URL per line.
**gml**
    Log parent-child relations between linked URLs as a GML sitemap
    graph.
**dot**
    Log parent-child relations between linked URLs as a DOT sitemap
    graph.
**gxml**
    Log check result as a GraphXML sitemap graph.
**xml**
    Log check result as machine-readable XML.
**sitemap**
    Log check result as an XML sitemap whose protocol is documented at
    https://www.sitemaps.org/protocol.html.
**sql**
    Log check result as SQL script with INSERT commands. An example
    script to create the initial SQL table is included as create.sql.
**failures**
    Suitable for cron jobs. Logs the check result into a file
    **~/.linkchecker/failures** which only contains entries with
    invalid URLs and the number of times they have failed.
**none**
    Logs nothing. Suitable for debugging or checking the exit code.

REGULAR EXPRESSIONS
-------------------

LinkChecker accepts Python regular expressions. See
https://docs.python.org/howto/regex.html for an introduction.
An addition is that a leading exclamation mark negates the regular
expression.

COOKIE FILES
------------

A cookie file contains standard HTTP header (RFC 2616) data with the
following possible names:

**Host** (required)
    Sets the domain the cookies are valid for.
**Path** (optional)
    Gives the path the cookies are value for; default path is **/**.
**Set-cookie** (required)
    Set cookie name/value. Can be given more than once.

Multiple entries are separated by a blank line. The example below will
send two cookies to all URLs starting with **http://example.com/hello/**
and one to all URLs starting with **https://example.org/**:

::

      Host: example.com
      Path: /hello
      Set-cookie: ID="smee"
      Set-cookie: spam="egg"

::

      Host: example.org
      Set-cookie: baggage="elitist"; comment="hologram"


PROXY SUPPORT
-------------

To use a proxy on Unix or Windows set the :envvar:`http_proxy` or
:envvar:`https_proxy` environment variables to the proxy URL. The URL should be
of the form
**http://**\ [*user*\ **:**\ *pass*\ **@**]\ *host*\ [**:**\ *port*].
LinkChecker also detects manual proxy settings of Internet Explorer
under Windows systems, and GNOME or KDE on Linux systems. On a Mac use
the Internet Config to select a proxy.
You can also set a comma-separated domain list in the :envvar:`no_proxy`
environment variables to ignore any proxy settings for these domains.

Setting a HTTP proxy on Unix for example looks like this:

.. code-block:: console

   $ export http_proxy="http://proxy.example.com:8080"

Proxy authentication is also supported:

.. code-block:: console

   $ export http_proxy="http://user1:mypass@proxy.example.org:8081"

Setting a proxy on the Windows command prompt:

.. code-block:: doscon

   C:\> set http_proxy=http://proxy.example.com:8080

PERFORMED CHECKS
----------------

All URLs have to pass a preliminary syntax test. Minor quoting mistakes
will issue a warning, all other invalid syntax issues are errors. After
the syntax check passes, the URL is queued for connection checking. All
connection check types are described below.

HTTP links (**http:**, **https:**)
    After connecting to the given HTTP server the given path or query is
    requested. All redirections are followed, and if user/password is
    given it will be used as authorization when necessary. All final
    HTTP status codes other than 2xx are errors.

    HTML page contents are checked for recursion.

Local files (**file:**)
    A regular, readable file that can be opened is valid. A readable
    directory is also valid. All other files, for example device files,
    unreadable or non-existing files are errors.

    HTML or other parseable file contents are checked for recursion.

Mail links (**mailto:**)
    A mailto: link eventually resolves to a list of email addresses.
    If one address fails, the whole list will fail. For each mail
    address we check the following things:

    1. Check the address syntax, both the parts before and after the
       @ sign.
    2. Look up the MX DNS records. If we found no MX record, print an
       error.
    3. Check if one of the mail hosts accept an SMTP connection. Check
       hosts with higher priority first. If no host accepts SMTP, we
       print a warning.
    4. Try to verify the address with the VRFY command. If we got an
       answer, print the verified address as an info.

FTP links (**ftp:**)
    For FTP links we do:

    1. connect to the specified host
    2. try to login with the given user and password. The default user
       is **anonymous**, the default password is **anonymous@**.
    3. try to change to the given directory
    4. list the file with the NLST command

Telnet links (**telnet:**)
    We try to connect and if user/password are given, login to the given
    telnet server.

NNTP links (**news:**, **snews:**, **nntp**)
    We try to connect to the given NNTP server. If a news group or
    article is specified, try to request it from the server.

Unsupported links (**javascript:**, etc.)
    An unsupported link will only print a warning. No further checking
    will be made.

    The complete list of recognized, but unsupported links can be found
    in the
    `linkcheck/checker/unknownurl.py <https://github.com/linkchecker/linkchecker/blob/master/linkcheck/checker/unknownurl.py>`__
    source file. The most prominent of them should be JavaScript links.

PLUGINS
-------

There are two plugin types: connection and content plugins. Connection
plugins are run after a successful connection to the URL host. Content
plugins are run if the URL type has content (mailto: URLs have no
content for example) and if the check is not forbidden (ie. by HTTP
robots.txt).
Use the option :option:`--list-plugins` for a list of plugins and their
documentation. All plugins are enabled via the :manpage:`linkcheckerrc(5)`
configuration file.

RECURSION
---------

Before descending recursively into a URL, it has to fulfill several
conditions. They are checked in this order:

1. A URL must be valid.
2. A URL must be parseable. This currently includes HTML files, Opera
   bookmarks files, and directories. If a file type cannot be determined
   (for example it does not have a common HTML file extension, and the
   content does not look like HTML), it is assumed to be non-parseable.
3. The URL content must be retrievable. This is usually the case except
   for example mailto: or unknown URL types.
4. The maximum recursion level must not be exceeded. It is configured
   with the :option:`--recursion-level` option and is unlimited per default.
5. It must not match the ignored URL list. This is controlled with the
   :option:`--ignore-url` option.
6. The Robots Exclusion Protocol must allow links in the URL to be
   followed recursively. This is checked by searching for a "nofollow"
   directive in the HTML header data.

Note that the directory recursion reads all files in that directory, not
just a subset like **index.htm**.

NOTES
-----

URLs on the commandline starting with **ftp.** are treated like
**ftp://ftp.**, URLs starting with **www.** are treated like
**http://www.**. You can also give local files as arguments.
If you have your system configured to automatically establish a
connection to the internet (e.g. with diald), it will connect when
checking links not pointing to your local host. Use the :option:`--ignore-url`
option to prevent this.

Javascript links are not supported.

If your platform does not support threading, LinkChecker disables it
automatically.

You can supply multiple user/password pairs in a configuration file.

When checking **news:** links the given NNTP host doesn't need to be the
same as the host of the user browsing your pages.

ENVIRONMENT
-----------

.. envvar:: NNTP_SERVER

   specifies default NNTP server

.. envvar:: http_proxy

   specifies default HTTP proxy server

.. envvar:: ftp_proxy

   specifies default FTP proxy server

.. envvar:: no_proxy

   comma-separated list of domains to not contact over a proxy server

.. envvar:: LC_MESSAGES, LANG, LANGUAGE

   specify output language

RETURN VALUE
------------

The return value is 2 when

-  a program error occurred.

The return value is 1 when

-  invalid links were found or
-  link warnings were found and warnings are enabled

Else the return value is zero.

LIMITATIONS
-----------

LinkChecker consumes memory for each queued URL to check. With thousands
of queued URLs the amount of consumed memory can become quite large.
This might slow down the program or even the whole system.

FILES
-----

**~/.linkchecker/linkcheckerrc** - default configuration file

**~/.linkchecker/failures** - default failures logger output filename

**linkchecker-out.**\ *TYPE* - default logger file output name

SEE ALSO
--------

:manpage:`linkcheckerrc(5)`

https://docs.python.org/library/codecs.html#standard-encodings - valid
output encodings

https://docs.python.org/howto/regex.html - regular expression
documentation