Commit graph

873 commits

Author SHA1 Message Date
Petr Dlouhý
e92b0a9f7b Python3: fix unicode in urlbase 2019-04-25 19:57:45 +01:00
Petr Dlouhý
b3881ce3b5 Python3: fix urlbase, strformat and others 2019-04-25 19:57:45 +01:00
anarcat
bb0a1e1992
Merge pull request #242 from cjmayo/wummel
Update references to GitHub project from wummel to linkchecker
2019-04-24 10:58:15 -04:00
anarcat
ee8667e1ca
Merge pull request #229 from cjmayo/python3_19
{python3_19} Python3: fix unicode in fileurl
2019-04-24 10:57:45 -04:00
Chris Mayo
f60810b050 Fix Python 3 "TypeError: decoding str is not supported" in FtpUrl.cwd 2019-04-22 19:34:46 +01:00
Petr Dlouhý
b40f4722c7 Python3: fix unicode in fileurl 2019-04-19 20:42:38 +01:00
EsuS
004632a99b Update references to GitHub project from wummel to linkchecker
Remove all mention of donations.
2019-04-18 19:59:52 +01:00
Petr Dlouhý
bc99dc51de Python3: fix HtmlParser 2019-04-18 19:35:16 +01:00
Petr Dlouhý
4acabf5cb5 fix urllib imports 2019-04-09 20:09:35 +01:00
gerdneuman
de6a82b378
Added whatsapp:// to ignored protocols
Fixes https://github.com/wummel/linkchecker/issues/595
2018-08-09 13:49:15 +02:00
Petr Dlouhý
256202a20b fixes for Python 3: fix proxysuport 2018-01-19 09:52:43 +01:00
Antoine Beaupré
9b12b5d66f
workaround new limitation in requests
newer requests do not expose the internal SSL socket object so we
cannot verify certificates. there was work to allow custom
verification routines which we could use, but this never finished:

https://github.com/shazow/urllib3/pull/257

so right now, just treat missing socket information as if the cert was
missing.

Closes: #76
2017-10-02 20:19:25 -04:00
Graham Seaman
233e7dcf68 Allow wayback-format urls without affecting atom 'feed' urls 2017-02-09 11:43:45 +00:00
Antoine Beaupré
9d899d1dfa add --no-robots commandline flag
While this flag can be abused, it seems to me like a legitimate use
case that you want to check a fairly small document for mistakes,
which includes references to a website which has a robots.txt that
denies all robots. It turns out that most websites do *not* add a
permission for LinkCheck to use their site, and some sites, like the
Debian BTS for example, are very hostile with bots in general.

Between me using linkcheck and me using my web browser to check those
links one by one, there is not a big difference. In fact, using
linkcheck may be *better* for the website because it will use HEAD
requests instead of a GET, and will not fetch all page elements
(javascript, images, etc) which can often be fairly big.

Besides, hostile users will patch the software themselves: it took me
only a few minutes to disable the check, and a few more to make that
into a proper patch.

By forcing robots.txt without any other option, we are hurting our
good users and not keeping hostile users from doing harm.

The patch is still incomplete, but works. It lacks: documentation and
unit tests.

Closes: #508
2016-05-19 14:43:59 -04:00
Bastian Kleineidam
92c4ca9a5e Debug request headers 2014-09-20 12:16:24 +02:00
Bastian Kleineidam
029c20ed98 More python3 fixes 2014-09-12 21:59:07 +02:00
Bastian Kleineidam
35eb30432e Added some Python3 fixes. 2014-09-12 19:36:30 +02:00
Bastian Kleineidam
06c6b80ed3 Fix proxy support. 2014-09-05 22:48:10 +02:00
Bastian Kleineidam
ee4545399d Support itms-services: URLs. #532 2014-09-05 21:06:10 +02:00
Bastian Kleineidam
c684918ba6 Ignore urllib3 warnings about invalid SSL certs since we check them ourselves. 2014-09-05 20:00:00 +02:00
Bastian Kleineidam
2354f16dbb Catch urllib3 errors. 2014-09-05 19:59:28 +02:00
Bastian Kleineidam
a665d35feb Use proxies and checker session in robots.txt. 2014-07-14 20:28:28 +02:00
Bastian Kleineidam
266e9e189f Further code cleanup. 2014-07-14 20:14:00 +02:00
Bastian Kleineidam
7838521b6e Code cleanup. 2014-07-14 19:49:01 +02:00
Bastian Kleineidam
eafa1ed2da Updated unknown URL schemes. 2014-07-13 21:51:53 +02:00
Bastian Kleineidam
0fa7ed2699 Fix empty URL handling. 2014-07-03 23:34:40 +02:00
Bastian Kleineidam
1590ab6240 cleanup 2014-07-01 21:12:47 +02:00
Bastian Kleineidam
9a124513e3 Merge branch 'master' of github.com:wummel/linkchecker 2014-07-01 21:11:33 +02:00
wummel
9bb3852edf Merge pull request #515 from Mark-Hetherington/extern-redirect
When following redirections update url.extern
2014-07-01 21:11:13 +02:00
Bastian Kleineidam
12cc12db53 Add get_redirects() function. 2014-07-01 21:11:06 +02:00
Bastian Kleineidam
cde261c009 Parse Refresh: and Content-Location: header values for URLs. 2014-07-01 20:16:43 +02:00
Bastian Kleineidam
c3ec91ac6d Fix intern URL search pattern. 2014-06-13 23:52:21 +02:00
Bastian Kleineidam
ad8eb424f3 Merge Mark-Hetherington-xml-parse-warn with slight modifications. 2014-06-13 20:50:37 +02:00
Mark Hetherington
34d83db29c When following redirections update url.extern 2014-05-19 14:59:58 +10:00
Bastian Kleineidam
eaa8a963ec Refactor logging configuration. 2014-05-10 21:23:06 +02:00
Bastian Kleineidam
0d9881cf03 Fix add_url() with local files. 2014-04-29 18:43:21 +02:00
Bastian Kleineidam
82dd76b0d7 Add PDF link parsing. 2014-04-28 18:13:45 +02:00
Bastian Kleineidam
0f8ee234c3 Fix documentation. 2014-04-28 18:10:20 +02:00
Bastian Kleineidam
6bae3e0f49 Use the same request arguments for redirects. 2014-04-23 22:03:44 +02:00
Bastian Kleineidam
6caf654031 Parse Link: heaaders. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
22caa9367a Refactor recursion checks. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
4759cee377 Updated mailto: documentation. 2014-03-30 08:30:14 +02:00
Bastian Kleineidam
81da2eb48f Code cleanup 2014-03-27 17:19:52 +01:00
Bastian Kleineidam
fa26876f67 Don't use encoding detection since it's very slow. 2014-03-27 12:27:11 +01:00
Bastian Kleineidam
49df359317 Some fixes when pyopenssl is used instead of python ssl module. 2014-03-26 19:59:17 +01:00
Bastian Kleineidam
dec0f6c8dc Fix error with SNI checks 2014-03-26 12:38:16 +01:00
Bastian Kleineidam
a8623bc0bc Display SSL info on redirects. 2014-03-26 07:16:03 +01:00
Bastian Kleineidam
be59802569 Set http connection charset. 2014-03-20 21:20:34 +01:00
Bastian Kleineidam
4c76345338 Add certificate valid date info and always set verify flag. 2014-03-19 17:16:42 +01:00
Bastian Kleineidam
9a7ad3a84f Print SSL cipher info for https URLs. 2014-03-19 17:02:34 +01:00
Bastian Kleineidam
ce733ae76b Don't check for robots.txt directives in local html files. 2014-03-19 16:33:22 +01:00
Bastian Kleineidam
9be667b52a Do not warn about missing addresses on mailto links that have subjects. 2014-03-18 23:27:59 +01:00
Bastian Kleineidam
6437f08277 Display downloaded bytes. 2014-03-14 21:06:10 +01:00
Bastian Kleineidam
c51caf1133 Assertions should be earlier. 2014-03-14 20:26:11 +01:00
Bastian Kleineidam
cfff4c4a84 Disable URL length warning for data: URLs. 2014-03-14 20:24:28 +01:00
Bastian Kleineidam
ccd0d4ead7 Updated the list of unknown or ignored URI schemes. 2014-03-12 19:20:49 +01:00
Bastian Kleineidam
bca226c293 Fix assertion checking external links; fix tests 2014-03-10 18:23:44 +01:00
Bastian Kleineidam
40b663cf9e Ignore URLs earlier. 2014-03-10 18:05:11 +01:00
Bastian Kleineidam
6b334dc79b Fix URL result caching. 2014-03-08 19:35:10 +01:00
Bastian Kleineidam
fab2c2da98 Improve content type setting. 2014-03-05 20:12:19 +01:00
Bastian Kleineidam
ef13a3fce1 Implement sitemap and sitemap index parsing. 2014-03-05 19:26:37 +01:00
Bastian Kleineidam
b72cf252fb Move parseable check down since it might get the content. 2014-03-05 19:26:05 +01:00
Bastian Kleineidam
9ef65cb774 Fix UrlData string representation. 2014-03-05 19:25:40 +01:00
Bastian Kleineidam
192cfab009 Cleanup of the UrlData.is_* functions 2014-03-05 19:23:16 +01:00
Bastian Kleineidam
978b24f2d7 Merge branch 'caching' 2014-03-04 07:21:42 +01:00
Bastian Kleineidam
f1076c8813 Increase url-too-long warning. 2014-03-03 23:31:04 +01:00
Bastian Kleineidam
82f81241fd Check all links and add better caching. 2014-03-03 23:29:45 +01:00
Bastian Kleineidam
6f205a2574 Support checking Sitemap: URLs in robots.txt files. 2014-03-01 20:25:19 +01:00
Bastian Kleineidam
0f0d79c7e0 Remove crawl-delay stuff 2014-03-01 20:01:42 +01:00
Bastian Kleineidam
7b34be590b Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements. 2014-03-01 00:12:34 +01:00
Bastian Kleineidam
c806be5c15 Updated copyright 2014-01-08 22:33:04 +01:00
Bastian Kleineidam
c076e312a2 Send an Accept header. 2014-01-08 19:56:00 +01:00
Bastian Kleineidam
e0a2558b2b Updated copyright. 2013-12-24 07:13:16 +01:00
wummel
9646f0b652 Merge pull request #418 from chuckbjones/reset-url-on-fallback
Reset to original url when falling back to GET
2013-12-17 22:37:17 -08:00
Bastian Kleineidam
103e00b4d1 Allow disabling of ssl certificate checks. 2013-12-12 22:17:57 +01:00
Bastian Kleineidam
0ca63797bf Remove content cache. 2013-12-10 23:41:52 +01:00
Bastian Kleineidam
2c5ede2eb7 Fallback to GET for Apache Coyote servers. 2013-12-08 08:22:56 +01:00
Bastian Kleineidam
023da7c993 Remove the duplicate URL content check. 2013-12-04 19:12:40 +01:00
Bastian Kleineidam
c676a4c829 Avoid DoS in SSL certificate host matching. 2013-11-30 22:07:23 +01:00
Charles Jones
4294633c04 Close connection prior to falling back to get, since we change the url back to the original at that time. 2013-08-09 13:08:51 -05:00
Charles Jones
8bc138f18b Reset to original url when falling back to GET 2013-07-30 13:38:59 -05:00
Bastian Kleineidam
c966fe6b24 Remove the http-wrong-redirect warning 2013-04-11 18:33:19 +02:00
Bastian Kleineidam
64d95e45e0 Remove local HTML and CSS syntax check. 2013-02-08 21:36:02 +01:00
Bastian Kleineidam
b104482174 Add missing docstring. 2013-01-25 21:15:12 +01:00
Bastian Kleineidam
35bc79dd90 Updated copyright. 2013-01-25 21:14:27 +01:00
Bastian Kleineidam
e6ad32c028 Catch UnicodeError for invalid host names. 2013-01-23 19:42:29 +01:00
Bastian Kleineidam
9b8cb67d78 Updated copyright. 2013-01-17 20:41:47 +01:00
Bastian Kleineidam
4dad2aa33c Support dns-prefetch URLs. 2013-01-17 20:41:09 +01:00
Bastian Kleineidam
7fe72745ae Updated copyright. 2013-01-09 23:03:12 +01:00
Bastian Kleineidam
a5b6136e70 Check word document validity before closing. 2013-01-07 21:58:02 +01:00
Bastian Kleineidam
0283362ce6 Updated copyright. 2012-12-23 21:32:16 +01:00
Bastian Kleineidam
9820530313 Use better_exchook to print more internal error info. 2012-12-18 23:06:48 +01:00
Bastian Kleineidam
42a17cbb98 Prepare py3 port and display sys.argv on internal errors. 2012-11-26 18:49:07 +01:00
Bastian Kleineidam
7ae1eadadb Improve http status 305 code message. 2012-11-13 18:13:36 +01:00
Bastian Kleineidam
cd4abb1f12 Improve repr() of url data, and remove alexa test script. 2012-11-09 19:09:38 +01:00
Bastian Kleineidam
810a62e093 Fix file url checking. 2012-11-07 19:37:16 +01:00
Bastian Kleineidam
f9a7f5ef96 Restrict local file checking. 2012-11-07 18:07:00 +01:00
Bastian Kleineidam
eabaa41bd2 Do not check duplicate URLs. 2012-11-06 21:34:22 +01:00
Bastian Kleineidam
9745be9d71 Fix cookie path matching with empty paths. 2012-10-30 17:44:00 +01:00
Bastian Kleineidam
e2fd37b886 Encode user and password for telnet connection. 2012-10-30 17:44:00 +01:00