Commit graph

938 commits

Author SHA1 Message Date
anarcat
bec68f237b
Merge pull request #299 from cjmayo/python3_42
{python3_42} fixes for Python 3: fix telneturl
2019-09-16 10:07:55 -04:00
anarcat
27d672c78b
Merge pull request #297 from cjmayo/python3_40
{python3_40} Python3: fixes form checker/__init__.py
2019-09-16 10:06:05 -04:00
Petr Dlouhý
c2af88ad2e Python3: fix for test_telnet in urlbase.py 2019-09-15 19:49:26 +01:00
Petr Dlouhý
a2e67af7b4 fixes for Python 3: fix telneturl 2019-09-15 19:49:18 +01:00
Petr Dlouhý
bb542b00e9 Python3: fixes form checker/__init__.py 2019-09-15 19:49:00 +01:00
Chris Mayo
06fdd78f91 Python3: fix TypeError in HttpUrl.read_content()
From test_http_redirect:

  File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content
    line: buf.write(data)
    locals:
      buf = <local> <_io.StringIO object at 0x7f8fe2f45e10>
      buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10>
      data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n'
TypeError: string argument expected, got 'bytes'
2019-09-15 19:42:29 +01:00
Chris Mayo
4c9ec511b5 Python3: fix opening file URLs
urllib.request.urlopen() expects a string or Request object.
2019-09-12 19:58:27 +01:00
anarcat
2239458966
Merge pull request #285 from cjmayo/python3_34
{python3_34} fixes for Python 3: fix test_misc
2019-09-11 09:48:14 -04:00
anarcat
492058a360
Merge pull request #281 from cjmayo/python3_30
{python3_30} Python3: fix decoding strings
2019-09-11 09:47:10 -04:00
Petr Dlouhý
f272206110 Python3: fix decoding strings 2019-09-10 19:52:23 +01:00
Petr Dlouhý
e10f25b968 fixes for Python 3: fix running problems in Python 3 2019-09-10 19:30:09 +01:00
Petr Dlouhý
129a68da38 fixes for Python 3: fix test_misc 2019-09-09 19:51:30 +01:00
Petr Dlouhý
ffb0a68ff7 Python3: fix fileurl 2019-09-05 19:41:53 +01:00
anarcat
59fe9ed876
Merge pull request #228 from cjmayo/python3_18
{python3_18} Python3: fix unicode in urlbase
2019-04-25 16:17:00 -04:00
anarcat
70f0bbf225
Merge pull request #250 from cjmayo/ftpserver
Get FtpServerTest working by updating to current pyftpdlib API
2019-04-25 16:16:33 -04:00
Petr Dlouhý
e92b0a9f7b Python3: fix unicode in urlbase 2019-04-25 19:57:45 +01:00
Petr Dlouhý
b3881ce3b5 Python3: fix urlbase, strformat and others 2019-04-25 19:57:45 +01:00
anarcat
bb0a1e1992
Merge pull request #242 from cjmayo/wummel
Update references to GitHub project from wummel to linkchecker
2019-04-24 10:58:15 -04:00
anarcat
ee8667e1ca
Merge pull request #229 from cjmayo/python3_19
{python3_19} Python3: fix unicode in fileurl
2019-04-24 10:57:45 -04:00
Chris Mayo
f60810b050 Fix Python 3 "TypeError: decoding str is not supported" in FtpUrl.cwd 2019-04-22 19:34:46 +01:00
Petr Dlouhý
b40f4722c7 Python3: fix unicode in fileurl 2019-04-19 20:42:38 +01:00
EsuS
004632a99b Update references to GitHub project from wummel to linkchecker
Remove all mention of donations.
2019-04-18 19:59:52 +01:00
Petr Dlouhý
bc99dc51de Python3: fix HtmlParser 2019-04-18 19:35:16 +01:00
Petr Dlouhý
4acabf5cb5 fix urllib imports 2019-04-09 20:09:35 +01:00
gerdneuman
de6a82b378
Added whatsapp:// to ignored protocols
Fixes https://github.com/wummel/linkchecker/issues/595
2018-08-09 13:49:15 +02:00
Petr Dlouhý
256202a20b fixes for Python 3: fix proxysuport 2018-01-19 09:52:43 +01:00
Antoine Beaupré
9b12b5d66f
workaround new limitation in requests
newer requests do not expose the internal SSL socket object so we
cannot verify certificates. there was work to allow custom
verification routines which we could use, but this never finished:

https://github.com/shazow/urllib3/pull/257

so right now, just treat missing socket information as if the cert was
missing.

Closes: #76
2017-10-02 20:19:25 -04:00
Graham Seaman
233e7dcf68 Allow wayback-format urls without affecting atom 'feed' urls 2017-02-09 11:43:45 +00:00
Antoine Beaupré
9d899d1dfa add --no-robots commandline flag
While this flag can be abused, it seems to me like a legitimate use
case that you want to check a fairly small document for mistakes,
which includes references to a website which has a robots.txt that
denies all robots. It turns out that most websites do *not* add a
permission for LinkCheck to use their site, and some sites, like the
Debian BTS for example, are very hostile with bots in general.

Between me using linkcheck and me using my web browser to check those
links one by one, there is not a big difference. In fact, using
linkcheck may be *better* for the website because it will use HEAD
requests instead of a GET, and will not fetch all page elements
(javascript, images, etc) which can often be fairly big.

Besides, hostile users will patch the software themselves: it took me
only a few minutes to disable the check, and a few more to make that
into a proper patch.

By forcing robots.txt without any other option, we are hurting our
good users and not keeping hostile users from doing harm.

The patch is still incomplete, but works. It lacks: documentation and
unit tests.

Closes: #508
2016-05-19 14:43:59 -04:00
Bastian Kleineidam
92c4ca9a5e Debug request headers 2014-09-20 12:16:24 +02:00
Bastian Kleineidam
029c20ed98 More python3 fixes 2014-09-12 21:59:07 +02:00
Bastian Kleineidam
35eb30432e Added some Python3 fixes. 2014-09-12 19:36:30 +02:00
Bastian Kleineidam
06c6b80ed3 Fix proxy support. 2014-09-05 22:48:10 +02:00
Bastian Kleineidam
ee4545399d Support itms-services: URLs. #532 2014-09-05 21:06:10 +02:00
Bastian Kleineidam
c684918ba6 Ignore urllib3 warnings about invalid SSL certs since we check them ourselves. 2014-09-05 20:00:00 +02:00
Bastian Kleineidam
2354f16dbb Catch urllib3 errors. 2014-09-05 19:59:28 +02:00
Bastian Kleineidam
a665d35feb Use proxies and checker session in robots.txt. 2014-07-14 20:28:28 +02:00
Bastian Kleineidam
266e9e189f Further code cleanup. 2014-07-14 20:14:00 +02:00
Bastian Kleineidam
7838521b6e Code cleanup. 2014-07-14 19:49:01 +02:00
Bastian Kleineidam
eafa1ed2da Updated unknown URL schemes. 2014-07-13 21:51:53 +02:00
Bastian Kleineidam
0fa7ed2699 Fix empty URL handling. 2014-07-03 23:34:40 +02:00
Bastian Kleineidam
1590ab6240 cleanup 2014-07-01 21:12:47 +02:00
Bastian Kleineidam
9a124513e3 Merge branch 'master' of github.com:wummel/linkchecker 2014-07-01 21:11:33 +02:00
wummel
9bb3852edf Merge pull request #515 from Mark-Hetherington/extern-redirect
When following redirections update url.extern
2014-07-01 21:11:13 +02:00
Bastian Kleineidam
12cc12db53 Add get_redirects() function. 2014-07-01 21:11:06 +02:00
Bastian Kleineidam
cde261c009 Parse Refresh: and Content-Location: header values for URLs. 2014-07-01 20:16:43 +02:00
Bastian Kleineidam
c3ec91ac6d Fix intern URL search pattern. 2014-06-13 23:52:21 +02:00
Bastian Kleineidam
ad8eb424f3 Merge Mark-Hetherington-xml-parse-warn with slight modifications. 2014-06-13 20:50:37 +02:00
Mark Hetherington
34d83db29c When following redirections update url.extern 2014-05-19 14:59:58 +10:00
Bastian Kleineidam
eaa8a963ec Refactor logging configuration. 2014-05-10 21:23:06 +02:00
Bastian Kleineidam
0d9881cf03 Fix add_url() with local files. 2014-04-29 18:43:21 +02:00
Bastian Kleineidam
82dd76b0d7 Add PDF link parsing. 2014-04-28 18:13:45 +02:00
Bastian Kleineidam
0f8ee234c3 Fix documentation. 2014-04-28 18:10:20 +02:00
Bastian Kleineidam
6bae3e0f49 Use the same request arguments for redirects. 2014-04-23 22:03:44 +02:00
Bastian Kleineidam
6caf654031 Parse Link: heaaders. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
22caa9367a Refactor recursion checks. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
4759cee377 Updated mailto: documentation. 2014-03-30 08:30:14 +02:00
Bastian Kleineidam
81da2eb48f Code cleanup 2014-03-27 17:19:52 +01:00
Bastian Kleineidam
fa26876f67 Don't use encoding detection since it's very slow. 2014-03-27 12:27:11 +01:00
Bastian Kleineidam
49df359317 Some fixes when pyopenssl is used instead of python ssl module. 2014-03-26 19:59:17 +01:00
Bastian Kleineidam
dec0f6c8dc Fix error with SNI checks 2014-03-26 12:38:16 +01:00
Bastian Kleineidam
a8623bc0bc Display SSL info on redirects. 2014-03-26 07:16:03 +01:00
Bastian Kleineidam
be59802569 Set http connection charset. 2014-03-20 21:20:34 +01:00
Bastian Kleineidam
4c76345338 Add certificate valid date info and always set verify flag. 2014-03-19 17:16:42 +01:00
Bastian Kleineidam
9a7ad3a84f Print SSL cipher info for https URLs. 2014-03-19 17:02:34 +01:00
Bastian Kleineidam
ce733ae76b Don't check for robots.txt directives in local html files. 2014-03-19 16:33:22 +01:00
Bastian Kleineidam
9be667b52a Do not warn about missing addresses on mailto links that have subjects. 2014-03-18 23:27:59 +01:00
Bastian Kleineidam
6437f08277 Display downloaded bytes. 2014-03-14 21:06:10 +01:00
Bastian Kleineidam
c51caf1133 Assertions should be earlier. 2014-03-14 20:26:11 +01:00
Bastian Kleineidam
cfff4c4a84 Disable URL length warning for data: URLs. 2014-03-14 20:24:28 +01:00
Bastian Kleineidam
ccd0d4ead7 Updated the list of unknown or ignored URI schemes. 2014-03-12 19:20:49 +01:00
Bastian Kleineidam
bca226c293 Fix assertion checking external links; fix tests 2014-03-10 18:23:44 +01:00
Bastian Kleineidam
40b663cf9e Ignore URLs earlier. 2014-03-10 18:05:11 +01:00
Bastian Kleineidam
6b334dc79b Fix URL result caching. 2014-03-08 19:35:10 +01:00
Bastian Kleineidam
fab2c2da98 Improve content type setting. 2014-03-05 20:12:19 +01:00
Bastian Kleineidam
ef13a3fce1 Implement sitemap and sitemap index parsing. 2014-03-05 19:26:37 +01:00
Bastian Kleineidam
b72cf252fb Move parseable check down since it might get the content. 2014-03-05 19:26:05 +01:00
Bastian Kleineidam
9ef65cb774 Fix UrlData string representation. 2014-03-05 19:25:40 +01:00
Bastian Kleineidam
192cfab009 Cleanup of the UrlData.is_* functions 2014-03-05 19:23:16 +01:00
Bastian Kleineidam
978b24f2d7 Merge branch 'caching' 2014-03-04 07:21:42 +01:00
Bastian Kleineidam
f1076c8813 Increase url-too-long warning. 2014-03-03 23:31:04 +01:00
Bastian Kleineidam
82f81241fd Check all links and add better caching. 2014-03-03 23:29:45 +01:00
Bastian Kleineidam
6f205a2574 Support checking Sitemap: URLs in robots.txt files. 2014-03-01 20:25:19 +01:00
Bastian Kleineidam
0f0d79c7e0 Remove crawl-delay stuff 2014-03-01 20:01:42 +01:00
Bastian Kleineidam
7b34be590b Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements. 2014-03-01 00:12:34 +01:00
Bastian Kleineidam
c806be5c15 Updated copyright 2014-01-08 22:33:04 +01:00
Bastian Kleineidam
c076e312a2 Send an Accept header. 2014-01-08 19:56:00 +01:00
Bastian Kleineidam
e0a2558b2b Updated copyright. 2013-12-24 07:13:16 +01:00
wummel
9646f0b652 Merge pull request #418 from chuckbjones/reset-url-on-fallback
Reset to original url when falling back to GET
2013-12-17 22:37:17 -08:00
Bastian Kleineidam
103e00b4d1 Allow disabling of ssl certificate checks. 2013-12-12 22:17:57 +01:00
Bastian Kleineidam
0ca63797bf Remove content cache. 2013-12-10 23:41:52 +01:00
Bastian Kleineidam
2c5ede2eb7 Fallback to GET for Apache Coyote servers. 2013-12-08 08:22:56 +01:00
Bastian Kleineidam
023da7c993 Remove the duplicate URL content check. 2013-12-04 19:12:40 +01:00
Bastian Kleineidam
c676a4c829 Avoid DoS in SSL certificate host matching. 2013-11-30 22:07:23 +01:00
Charles Jones
4294633c04 Close connection prior to falling back to get, since we change the url back to the original at that time. 2013-08-09 13:08:51 -05:00
Charles Jones
8bc138f18b Reset to original url when falling back to GET 2013-07-30 13:38:59 -05:00
Bastian Kleineidam
c966fe6b24 Remove the http-wrong-redirect warning 2013-04-11 18:33:19 +02:00
Bastian Kleineidam
64d95e45e0 Remove local HTML and CSS syntax check. 2013-02-08 21:36:02 +01:00
Bastian Kleineidam
b104482174 Add missing docstring. 2013-01-25 21:15:12 +01:00
Bastian Kleineidam
35bc79dd90 Updated copyright. 2013-01-25 21:14:27 +01:00
Bastian Kleineidam
e6ad32c028 Catch UnicodeError for invalid host names. 2013-01-23 19:42:29 +01:00
Bastian Kleineidam
9b8cb67d78 Updated copyright. 2013-01-17 20:41:47 +01:00
Bastian Kleineidam
4dad2aa33c Support dns-prefetch URLs. 2013-01-17 20:41:09 +01:00
Bastian Kleineidam
7fe72745ae Updated copyright. 2013-01-09 23:03:12 +01:00
Bastian Kleineidam
a5b6136e70 Check word document validity before closing. 2013-01-07 21:58:02 +01:00
Bastian Kleineidam
0283362ce6 Updated copyright. 2012-12-23 21:32:16 +01:00
Bastian Kleineidam
9820530313 Use better_exchook to print more internal error info. 2012-12-18 23:06:48 +01:00
Bastian Kleineidam
42a17cbb98 Prepare py3 port and display sys.argv on internal errors. 2012-11-26 18:49:07 +01:00
Bastian Kleineidam
7ae1eadadb Improve http status 305 code message. 2012-11-13 18:13:36 +01:00
Bastian Kleineidam
cd4abb1f12 Improve repr() of url data, and remove alexa test script. 2012-11-09 19:09:38 +01:00
Bastian Kleineidam
810a62e093 Fix file url checking. 2012-11-07 19:37:16 +01:00
Bastian Kleineidam
f9a7f5ef96 Restrict local file checking. 2012-11-07 18:07:00 +01:00
Bastian Kleineidam
eabaa41bd2 Do not check duplicate URLs. 2012-11-06 21:34:22 +01:00
Bastian Kleineidam
9745be9d71 Fix cookie path matching with empty paths. 2012-10-30 17:44:00 +01:00
Bastian Kleineidam
e2fd37b886 Encode user and password for telnet connection. 2012-10-30 17:44:00 +01:00
Bastian Kleineidam
c6d8b0050e Improve PHP command check. 2012-10-29 21:05:26 +01:00
Bastian Kleineidam
e8da486d66 Detect redirection errors when getting content. 2012-10-26 18:05:00 +02:00
Bastian Kleineidam
2390827735 Debug cookies. 2012-10-25 17:53:16 +02:00
Bastian Kleineidam
c44aa2db1f Fix anchor checking of cached HTTP URLs by using the cached content type. 2012-10-25 06:37:10 +02:00
Bastian Kleineidam
dca52145d3 Misc stuff. 2012-10-24 22:59:28 +02:00
Bastian Kleineidam
b39158e65c Improve available anchor message. 2012-10-24 22:21:46 +02:00
Bastian Kleineidam
dd2c963fac Fix non-ASCII exception handling. 2012-10-24 22:14:45 +02:00
Bastian Kleineidam
64de760b97 Added debug statements for unparseable content types. 2012-10-24 22:06:42 +02:00
Bastian Kleineidam
2ebedbaaa6 Fix content reading. 2012-10-13 16:48:29 +02:00
Bastian Kleineidam
0e4e694ad1 Fix connection handling on redirects. 2012-10-13 13:36:43 +02:00
Bastian Kleineidam
d3b44be2c4 Improved documentation. 2012-10-13 12:03:19 +02:00
Bastian Kleineidam
6a204120b6 Handle stale file system links for local file checks. 2012-10-12 17:20:19 +02:00
Bastian Kleineidam
b758fc6f52 Reuse existing response. 2012-10-10 12:27:36 +02:00
Bastian Kleineidam
e1e80b7dd5 Remove addrinfo cache. 2012-10-10 10:54:58 +02:00
Bastian Kleineidam
f484a6776d Use timeout value from configuration. 2012-10-10 10:53:52 +02:00
Bastian Kleineidam
06a25676c5 Only read the maximum data size plus one, not the whole file. 2012-10-10 06:35:33 +02:00
Bastian Kleineidam
6d47b76509 Limit HTTP and FTP connections. Gets rid of spurious BadStatusLine errors. 2012-10-09 21:04:20 +02:00
Bastian Kleineidam
ad8525c483 Improve BadStatusline error message. 2012-10-05 08:32:24 +02:00
Bastian Kleineidam
d15fafb1f7 Code cleanup. 2012-10-05 08:10:44 +02:00
Bastian Kleineidam
ed7c60e491 Do not warn about duplicate URLs which can point to the same content. 2012-10-01 13:42:46 +02:00
Bastian Kleineidam
38dd63f055 Code cleanup. 2012-09-23 16:19:42 +02:00
Bastian Kleineidam
7f8fd01b22 Add Accept-Encoding and Accept-Charset headers. 2012-09-23 15:06:44 +02:00
Bastian Kleineidam
03ecff22bb Fix endless loop in http authentication. 2012-09-22 22:21:10 +02:00
Bastian Kleineidam
653b5f27dd Updated ignored schemes. 2012-09-22 16:18:37 +02:00
Bastian Kleineidam
1c59cb4d4c Use GET in case a HEAD method does not succeed, even if robots.txt content checkes denied the page. This way proper check results are achieved (but the content is still not checked, so it's ok). 2012-09-22 07:53:11 +02:00
Bastian Kleineidam
bbf25106fa Fix double result setting on http checks. 2012-09-21 20:33:15 +02:00
Bastian Kleineidam
c274b50c50 Store lowercase URL scheme in checker class. 2012-09-21 14:35:25 +02:00
Bastian Kleineidam
0941f6ff02 Improve exception handling by using unicode. 2012-09-21 14:29:20 +02:00
Bastian Kleineidam
049882e4fe Remove accept-encoding since some sites have wrong compression. 2012-09-20 22:39:15 +02:00
Bastian Kleineidam
7c6dce6136 Only warn non-empty site duplicates. 2012-09-20 20:39:36 +02:00
Bastian Kleineidam
a03090c20f Optimize intern/extern pattern parsing. 2012-09-20 20:19:13 +02:00
Bastian Kleineidam
b9d234c78a Fix wrong method name in SSL certificate check. 2012-09-20 16:28:01 +02:00
Bastian Kleineidam
bff217c58b Never log ignored warnings. 2012-09-20 12:44:40 +02:00
Bastian Kleineidam
600b7c0e69 Fix duplicate content warning when self.size is not set yet. 2012-09-20 12:44:23 +02:00
Bastian Kleineidam
18a200d85f Fix tests. 2012-09-19 11:05:26 +02:00