Commit graph

3291 commits

Author SHA1 Message Date
Chris Mayo
16bee50068 Move AnchorCheck local file handling into a new class
When checking local files with AnchorCheck, anchors in URLs
like "example/#anchor" are not supported.

Without AnchorCheck enabled, the Real URL reported for such URLs
was changed to include the anchor when local file checking was added to
AnchorCheck, but it is the directory that is checked.
The same URL was also then used as the Parent URL for the check of each
of the contents of that directory.

For FileUrl this is a revert of:
c221afda ("Enable AnchorCheck to be used with local files", 2022-10-03)
2022-10-24 19:30:56 +01:00
Chris Mayo
e32c76aa5c Make text logger outro "checked" translatable 2022-10-18 19:24:08 +01:00
Chris Mayo
b6eea83f63
Merge pull request #676 from cjmayo/robotmap
Document sitemaps in linkchecker(1)
2022-10-17 19:25:57 +01:00
Chris Mayo
96c3336013
Merge pull request #677 from cjmayo/maxrate
Enable average HTTP request rate to be above 4 per second
2022-10-17 19:24:49 +01:00
Chris Mayo
689557d9af Add logging of MIME types and improve docstrings 2022-10-17 19:21:03 +01:00
Chris Mayo
eab2fa410e Log robots.txt as the sitemap parent URL
This is the location the sitemap URL was found in. The line being
reported is the line in robots.txt.
2022-10-17 19:21:03 +01:00
Chris Mayo
7367e6e865 Skip incomplete Sitemap in robots.txt and warn
Sitemap values should be fully qualified URLs; LinkChecker may not
resolve relative paths correctly.
2022-10-17 19:21:03 +01:00
Chris Mayo
0c5db040c8 Support maxrequestspersecond less than one 2022-10-05 19:28:01 +01:00
Chris Mayo
e88cf49c8f Enable average HTTP request rate to be above 4 per second 2022-10-05 19:28:01 +01:00
Chris Mayo
f2be98b8ad Replace deprecated dns.resolver.query()
Missed in:
26c15c5e ("Fix deprecation warning for resolver.query()", 2020-09-14)
2022-10-05 19:27:13 +01:00
Chris Mayo
354ea933ca
Merge pull request #673 from cjmayo/sitemap
Fix sitemap output with multiple threads
2022-10-05 19:20:40 +01:00
Nathan Arthur
33036803b0 Fix a difference in anchor quoting between http and file
"I added a test for file:// processing, and it was showing different
results for when the URL anchor was and wasn't quoted. I tracked it down
to code in fileurl.py that was calling url_norm, and I'm pretty sure the
code is unnecessary at this point. But I made a minimally-invasive
change, to be as safe as possible."

UrlBase.build_url() in line 174 also calls url_norm()
2022-10-03 19:33:05 +01:00
Nathan Arthur
4cdaa59fcc Fix AnchorCheck mismatching encoded anchors
Problem identified by Christian Kirchhof.
2022-10-03 19:33:05 +01:00
Nathan Arthur
6499b7b233 Fix a major thread-safety bug in AnchorCheck
The threading issue has been there for years, but I didn't notice it
until after I thought I was done, while I was doing manual testing
(with threads re-enabled).

The problem was with storing URL-specific state (.anchors) on the
AnchorCheck object itself, because there's only one global AnchorCheck
object, so all the threads are competing to use that one simgle variable
(self.anchors).

The solution was to create a new object to hold .anchors, for each
processed URL.
2022-10-03 19:33:05 +01:00
Nathan Arthur
c221afdab5 Enable AnchorCheck to be used with local files
[I] discovered that fileurl.py was stripping the anchors from url_data,
which breaks AnchorCheck. So I stopped it from doing that, and
tried to fix up all the places that were assuming the url would map to a
filesystem file. The tests all pass, but I'm not 100% sure I caught all
the cases, or fixed them correctly.
2022-10-03 19:33:05 +01:00
Nathan Arthur
a29750c57f Fix anchor comments in UrlBase
Parent url query not stripped since:
4a0c63aa ("Fix joining of URLs when parent URL has CGI parameter.", 2011-02-08)
2022-10-03 19:33:05 +01:00
Chris Mayo
54bcefd7d7 Revert "Disable AnchorCheck plugin"
This reverts commit 0356524369.
2022-10-03 19:33:05 +01:00
Chris Mayo
e6763f8516 Fix sitemap output with multiple threads
SitemapXmlLogger assumes the first result logged is for the root of the
website being mapped. Ensure results are logged before content is
checked.
2022-09-30 19:22:17 +01:00
Chris Mayo
52b9881820 Separate URL encoding and content encoding
Ensure users of url_data.encoding are using the URL encoding.

Combined since:
5fc01455 ("Decode content when retrieved, use bs4 to detect encoding if non-Unicode", 2019-09-30)
2022-09-29 19:21:11 +01:00
Chris Mayo
61071fc5dc
Merge pull request #668 from cjmayo/defaults
Clarify default values in initial linkcheckerrc and elsewhere
2022-09-28 19:36:44 +01:00
Lukas Pirl
8c959589c3
add option to ignore specific errors for specific URLs 2022-09-25 22:52:04 +02:00
Chris Mayo
e5168f44ea Clarify defaults and examples in initial linkcheckerrc 2022-09-22 19:24:55 +01:00
Chris Mayo
b8d0928969 Document dialect option of csv logger 2022-09-22 19:24:55 +01:00
Chris Mayo
130347f223 Remove unused WARN_IGNORE_URL
URL ignored was changed to an info message in:
7b34be59 ("Introduce check plugins, use Python requests for http/s
connections, and some code cleanups and improvements.", 2014-03-01)
2022-09-22 19:24:55 +01:00
Chris Mayo
36a45b0f96
Merge pull request #666 from cjmayo/gemini
Add gemini scheme
2022-09-22 19:23:20 +01:00
Nathan Arthur
6dc5ade29d Fixed a bug where the resultcachesize setting was ignored. 2022-09-20 19:36:23 +01:00
Chris Mayo
ed8e17137c Add gemini scheme 2022-09-16 19:21:32 +01:00
Chris Mayo
25ce4b854c Update IANA schemes 2022-09-16 19:21:32 +01:00
Chris Mayo
af265f3d52 Write all metadata used to _release.py
Enables running without installing.
Removes use of importlib.metadata.
2022-09-13 19:32:06 +01:00
Chris Mayo
30e8cfad77
Merge pull request #651 from cjmayo/rate
Rename url-rate-limited to http-rate-limited
2022-09-12 19:25:52 +01:00
Stefan Fisk
d2b9723612 Fix srcset parsing
Resolves #631
2022-09-07 21:24:23 +02:00
Chris Mayo
a0b28cc0ff Rename url-rate-limited to http-rate-limited
Make consistent with the other warnings:

- The first part of the name represents the checker class in which the
  warning is raised

- Update initial comment
2022-09-06 19:32:24 +01:00
Chris Mayo
3c7fb5b571 Fix checking directory containing Unicode filenames
Non-Unicode filenames are not supported.

sys.platform has not returned "linux2" since Python 3.3.
2022-09-05 19:28:40 +01:00
Chris Mayo
c627b00755
Merge pull request #639 from cjmayo/hatch
Replace setuptools and setup.py with hatch and pyproject.toml
2022-09-05 19:27:48 +01:00
Chris Mayo
d5058ecd7c
Merge pull request #643 from cjmayo/altname
Replace deprecated urllib3.contrib.pyopenssl.get_subj_alt_name()
2022-09-05 19:26:12 +01:00
Chris Mayo
47d1015e00 Replace setuptools and setup.py with hatch and pyproject.toml 2022-09-05 19:24:01 +01:00
Chris Mayo
f0cb2e9df9 Use cryptography.x509.not_valid_after 2022-09-05 19:20:19 +01:00
Chris Mayo
76e2712311 Replace deprecated urllib3.contrib.pyopenssl.get_subj_alt_name()
Strictly we should add a dependency on cryptography as we are now using
it directly - but for pyopenssl x509.to_cryptography() to work
cryptography would have to be already installed.
2022-09-05 19:20:19 +01:00
Chris Mayo
c79bc07cee Add MIME type application/vnd.adobe.flash.movie 2022-09-02 19:29:11 +01:00
Chris Mayo
6d9061b00a Ignore bs4 markup and XML parser warnings
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document
using an HTML parser.

MarkupResemblesLocatorWarning: The input looks more like a filename than
markup.

MarkupResemblesLocatorWarning: The input looks more like a URL than
markup.
2022-09-02 19:29:11 +01:00
Chris Mayo
d6936ceb91 Add warning url-content-type-unparseable 2022-09-02 19:29:11 +01:00
Kian-Meng Ang
a70ea9ea14 Fix typos
Found via `codespell ./linkcheck/ ./tests ./doc/man/en -L bu,noone,fo,pres,shttp`
2022-09-02 17:20:02 +08:00
Chris Mayo
b35036af2b
Merge pull request #634 from cjmayo/pyxdg
Remove dependency on pyxdg
2022-08-30 19:28:03 +01:00
Chris Mayo
d72649453c
Merge pull request #632 from cjmayo/docs
Assorted documentation updates
2022-08-30 19:27:10 +01:00
Felix Yan
7db1a867ab
Correct a typo in i18n.py 2022-08-24 19:10:41 +03:00
Chris Mayo
fbceca5dc9 Remove dependency on pyxdg
Read the environment variables and implement the same fallbacks.
Saves a hardly used dependency and is more explicit.
2022-08-23 19:26:15 +01:00
Chris Mayo
10f3d33041 Finish documenting the use of XDG_CONFIG_HOME and XDG_DATA_HOME
Introduced by:
a03e2e4a ("use xdg dirs for config & data", 2017-10-17)
2022-08-23 19:21:53 +01:00
Chris Mayo
94781120ac Correct mention of pdfminer in WordParser comment 2022-05-18 19:29:54 +01:00
Malte Gerth
cc48a09308 Add Telegram and WhatsApp link schemes 2022-02-06 23:41:33 +01:00
Malte Gerth
067dd8edbb Update IANA schemes 2022-02-06 23:40:36 +01:00