Commit graph

121 commits

Author SHA1 Message Date
Chris Mayo
eab2fa410e Log robots.txt as the sitemap parent URL
This is the location the sitemap URL was found in. The line being
reported is the line in robots.txt.
2022-10-17 19:21:03 +01:00
Chris Mayo
7367e6e865 Skip incomplete Sitemap in robots.txt and warn
Sitemap values should be fully qualified URLs; LinkChecker may not
resolve relative paths correctly.
2022-10-17 19:21:03 +01:00
Chris Mayo
fe5a34c68f Remove linkcheck.checker.proxysupport
Set up the requests.Session() with the complete proxy configuration
to fix a problem with using an HTTP server as an HTTPS proxy and
potential redirection issues.

Requests handles no_proxy.
2021-12-13 19:25:23 +00:00
Chris Mayo
2a77e12618 Replace deprecated Thread.getName() and Condition.notifyAll() 2021-11-16 19:45:38 +00:00
Paul Haerle
f395c74aac
Make ResultCache max_size configurable (#544)
* Make ResultCache max_size configurable

fixes #463

* Add tests and docs.

* fix documentation...

...adapt the source, not the auto-generated man pages themselves as
requested in #544.

* fix typo.
2021-06-21 19:45:19 +01:00
Chris Mayo
500c13e2cb Log a debug message when a cached URL is skipped
Skipping introduced in:
eaa538c8 ("don't check one url multiple times", 2016-11-09)
2020-07-21 19:54:18 +01:00
Chris Mayo
b974ec3262 Review comments on black linkcheck/ 2020-06-01 16:07:21 +01:00
Chris Mayo
a92a684ac4 Run black on linkcheck/ 2020-05-30 17:01:36 +01:00
Marius Gedminas
4f3fe5e1c3 Make sure fetching robots.txt uses the configured timeout
Closes #396.
2020-05-22 10:53:33 +03:00
Chris Mayo
a15a2833ca Remove spaces after names in class method definitions
And also nested functions.

This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
44e81d27dd Remove inheriting object
All Python 3 classes are new-style.
2020-05-08 10:45:31 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Yaroslav Halchenko
b78c2d200e DOC: minor typo fix 2018-11-01 11:08:09 -04:00
Marius Gedminas
4a092c218c Whitespace bigotry 2017-03-14 17:18:27 +02:00
Petr Dlouhý
eaa538c814 don't check one url multiple times 2017-02-14 10:23:25 +01:00
Nicolas Bigaouette
4e56eceb35 Detect if "url_data" contains proxy attributes before using them.
Fix proposed by @colwilson in issue #555.
2014-11-12 09:58:30 -05:00
Bastian Kleineidam
35eb30432e Added some Python3 fixes. 2014-09-12 19:36:30 +02:00
Bastian Kleineidam
06c6b80ed3 Fix proxy support. 2014-09-05 22:48:10 +02:00
Arlo Louis O'Keeffe
52337f82cb Use correct attribute 2014-09-03 09:36:22 +02:00
Bastian Kleineidam
b646293fd6 Remove unused import. 2014-07-15 22:38:57 +02:00
Bastian Kleineidam
90257a1b5e Replace twill with custom code. 2014-07-15 18:37:05 +02:00
Bastian Kleineidam
a665d35feb Use proxies and checker session in robots.txt. 2014-07-14 20:28:28 +02:00
Bastian Kleineidam
6c38b4165a Use given HTTP auth data for robots.txt fetching. 2014-07-14 19:50:11 +02:00
Bastian Kleineidam
22caa9367a Refactor recursion checks. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
08fbd891ef Do not check external robots.txt sitemaps. 2014-04-09 19:44:29 +02:00
Bastian Kleineidam
c57f607fc3 Use urldata.add_url() 2014-04-07 18:54:33 +02:00
Bastian Kleineidam
fc73c6ca6e Log number of checked unique URLs. 2014-03-14 23:46:17 +01:00
Bastian Kleineidam
19b8baf08c Move cached queue items to top once in a while. 2014-03-14 22:08:51 +01:00
Bastian Kleineidam
b18854649d Count unique URLs for url queue limit. 2014-03-14 20:21:46 +01:00
Bastian Kleineidam
257644e660 Add cache length function to get number of cached elements. 2014-03-14 20:19:34 +01:00
Bastian Kleineidam
6b334dc79b Fix URL result caching. 2014-03-08 19:35:10 +01:00
Bastian Kleineidam
b17211f162 Set for release. 2014-03-04 21:36:24 +01:00
Bastian Kleineidam
82f81241fd Check all links and add better caching. 2014-03-03 23:29:45 +01:00
Bastian Kleineidam
6f205a2574 Support checking Sitemap: URLs in robots.txt files. 2014-03-01 20:25:19 +01:00
Bastian Kleineidam
0f0d79c7e0 Remove crawl-delay stuff 2014-03-01 20:01:42 +01:00
Bastian Kleineidam
7b34be590b Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements. 2014-03-01 00:12:34 +01:00
Bastian Kleineidam
c806be5c15 Updated copyright 2014-01-08 22:33:04 +01:00
Bastian Kleineidam
e0a2558b2b Updated copyright. 2013-12-24 07:13:16 +01:00
Bastian Kleineidam
0ca63797bf Remove content cache. 2013-12-10 23:41:52 +01:00
Bastian Kleineidam
36badddfac Update cookie code from Python module. 2013-12-04 19:05:08 +01:00
Bastian Kleineidam
123578a4cd Make per-host connection limits configurable. 2013-02-27 19:37:28 +01:00
Bastian Kleineidam
35bc79dd90 Updated copyright. 2013-01-25 21:14:27 +01:00
Bastian Kleineidam
faa743e876 Increase per-host connection limits. 2013-01-22 18:18:48 +01:00
Bastian Kleineidam
0283362ce6 Updated copyright. 2012-12-23 21:32:16 +01:00
Bastian Kleineidam
42a17cbb98 Prepare py3 port and display sys.argv on internal errors. 2012-11-26 18:49:07 +01:00
Bastian Kleineidam
e5735e2a5d Fix URL queue handling. 2012-11-08 12:48:21 +01:00
Bastian Kleineidam
bc683577de Remove URLs from the in_progress cache. 2012-11-08 11:03:16 +01:00
Bastian Kleineidam
eabaa41bd2 Do not check duplicate URLs. 2012-11-06 21:34:22 +01:00
Bastian Kleineidam
8750d55a73 Add configuration entry for maximum number of URLs. 2012-10-14 11:13:55 +02:00
Bastian Kleineidam
3b5877161c Improved debugging. 2012-10-13 13:36:28 +02:00