The test unzipps a zip file with a weird-looking non-ASCII filename in it.
I don't think zip files specify the encoding for filenames. Different
unzip utilities may interpret the filename differently. Plus, the byte
representation of the unzipped filename may be different depending on
the filesystem charset.
To me it looks as if the filename is garbage encoded as valid UTF-8, and
the test expectation is to get it in latin-1 or something.
The HTML tag has two attributes with URLs:
<applet archive="file.html" src="file.css">
It would appear that the order in which these attributes are crawled
does not match the order in the result file.
Possibly the crawling order is non-deterministic, although I cannot
reproduce that. If that's the case, the fix would be to sort the
attributes in the crawler before following them, which means we want the
expected results sorted as well (and since 'archive' comes before 'src',
so file.html should come before file.css).
We must do this, because py.test adjusts sys.path to make
'tests.test_foo' importable [*]. When py.test does this, the
'linkcheck' directory at the top of the git tree is the one that gets
imported in the tests. If we've told pip to use develop mode, all's
fine. If we haven't, then we're going to get errors because extension
modules like _network.so get installed into
.tox/*/lib/*/site-packages/linkcheck/network and not into
./linkcheck/network/
[*] http://doc.pytest.org/en/latest/goodpractices.html#choosing-a-test-layout-import-rules
The one test failure in Travis happens in
TestConsole.test_internal_error, but only if you have the argcomplete
package installed.
This was a real bug in error reporting code.
While this flag can be abused, it seems to me like a legitimate use
case that you want to check a fairly small document for mistakes,
which includes references to a website which has a robots.txt that
denies all robots. It turns out that most websites do *not* add a
permission for LinkCheck to use their site, and some sites, like the
Debian BTS for example, are very hostile with bots in general.
Between me using linkcheck and me using my web browser to check those
links one by one, there is not a big difference. In fact, using
linkcheck may be *better* for the website because it will use HEAD
requests instead of a GET, and will not fetch all page elements
(javascript, images, etc) which can often be fairly big.
Besides, hostile users will patch the software themselves: it took me
only a few minutes to disable the check, and a few more to make that
into a proper patch.
By forcing robots.txt without any other option, we are hurting our
good users and not keeping hostile users from doing harm.
The patch is still incomplete, but works. It lacks: documentation and
unit tests.
Closes: #508
- Add the warningregex parameter to the RegexCheck section.
- Add a note that the REGEX shouldn't be quoted.
- Change the quote style to double quotes to match the rest of the document.