When checking local files with AnchorCheck, anchors in URLs
like "example/#anchor" are not supported.
Without AnchorCheck enabled, the Real URL reported for such URLs
was changed to include the anchor when local file checking was added to
AnchorCheck, but it is the directory that is checked.
The same URL was also then used as the Parent URL for the check of each
of the contents of that directory.
For FileUrl this is a revert of:
c221afda ("Enable AnchorCheck to be used with local files", 2022-10-03)
"I added a test for file:// processing, and it was showing different
results for when the URL anchor was and wasn't quoted. I tracked it down
to code in fileurl.py that was calling url_norm, and I'm pretty sure the
code is unnecessary at this point. But I made a minimally-invasive
change, to be as safe as possible."
UrlBase.build_url() in line 174 also calls url_norm()
The threading issue has been there for years, but I didn't notice it
until after I thought I was done, while I was doing manual testing
(with threads re-enabled).
The problem was with storing URL-specific state (.anchors) on the
AnchorCheck object itself, because there's only one global AnchorCheck
object, so all the threads are competing to use that one simgle variable
(self.anchors).
The solution was to create a new object to hold .anchors, for each
processed URL.
[I] discovered that fileurl.py was stripping the anchors from url_data,
which breaks AnchorCheck. So I stopped it from doing that, and
tried to fix up all the places that were assuming the url would map to a
filesystem file. The tests all pass, but I'm not 100% sure I caught all
the cases, or fixed them correctly.
Ensure users of url_data.encoding are using the URL encoding.
Combined since:
5fc01455 ("Decode content when retrieved, use bs4 to detect encoding if non-Unicode", 2019-09-30)
URL ignored was changed to an info message in:
7b34be59 ("Introduce check plugins, use Python requests for http/s
connections, and some code cleanups and improvements.", 2014-03-01)
Make consistent with the other warnings:
- The first part of the name represents the checker class in which the
warning is raised
- Update initial comment
Strictly we should add a dependency on cryptography as we are now using
it directly - but for pyopenssl x509.to_cryptography() to work
cryptography would have to be already installed.
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document
using an HTML parser.
MarkupResemblesLocatorWarning: The input looks more like a filename than
markup.
MarkupResemblesLocatorWarning: The input looks more like a URL than
markup.