We want to allow specifying a warning to ignore for
each URL. If no regex is specified for the warning to ignore,
we'll ignore all warnings.
The tests still pass as they are, which means that unknown
values in the configuration file are simply ignored.
* [#782] Add values to configuration file
* [#782] Parse new configuration values
* [#782] Actually ignore a warning
* [#782] Confirm side cases work as expected
* [#782] Add logging when deciding to ignore warnings
* [#782] Documentation for ignorewarningsforurls
* [#782] Update (generated) man pages
* [#782] These tests pass without network, actually
* [#782] Fix copy/paste error in symbol naming
* [#782] The regex matches the name of the warning, not the message
* [#782] Better wording
* [#782] Update (generated) man pages
* [#782] We match the type, not the message
I started with a test of urlencoded anchors, assuming at the URL might
have a urlencoded anchor, but the actual anchor in the HTML would NOT be
urlencoded.
The minimum version supported was already 4.8.0 because of the use
of multi_valued_attributes [1].
Test support for < 4.8.1 is the only code that needs removing [2].
[1] 3ff3d724 ("Use BeautifulSoup element attrs directly", 2020-04-03)
[2] 607328d5 ("Support Beautiful Soup line numbers", 2019-10-05)
When BeautifulSoup finds an empty file on disk, it sets
original_encoding to None. It doesn't matter what encoding we pick for
empty files, so let's just pick one.
I don't know if there are any circumstances where BeautifulSoup might
set the encoding to None for a non-empty file.
Closes#392.
norobots.html was used for testing <meta name="robots"
content="nofollow"> in local files until [1]. This commit reinstates
local file testing and adds an http test.
Checking is reported by checker.httpurl.HttpUrl.content_allows_robots().
[1] ce733ae7 ("Don't check for robots.txt directives in local html
files.", 2014-03-19)
The HTML tag has two attributes with URLs:
<applet archive="file.html" src="file.css">
It would appear that the order in which these attributes are crawled
does not match the order in the result file.
Possibly the crawling order is non-deterministic, although I cannot
reproduce that. If that's the case, the fix would be to sort the
attributes in the crawler before following them, which means we want the
expected results sorted as well (and since 'archive' comes before 'src',
so file.html should come before file.css).