Commit graph

300 commits

Author SHA1 Message Date
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa Remove u string prefixes 2020-04-30 20:11:59 +01:00
Chris Mayo
4ffdbf2406 Replace MetaRobotsFinder using BeautifulSoup.find() 2020-04-29 20:07:00 +01:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
0795e3c1b4 Replace Parser class using BeautifulSoup.find_all() 2020-04-10 13:51:09 +01:00
Chris Mayo
02e1c389b2 Remove parser flush() and reset()
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
40f43ae41c Create one function to make soup objects 2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06 Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
3ff3d72492 Use BeautifulSoup element attrs directly 2020-04-03 19:24:08 +01:00
Chris Mayo
5b66964afa Remove unused .charset from checker classes
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Chris Mayo
5eaad24641 Use HTTP header encoding for decoding 2020-03-22 19:54:37 +00:00
Chris Mayo
153e53ba03 Reuse soup object used for detecting encoding in the HTML parser 2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf Don't set parser.encoding
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Marius Gedminas
58b0d5aaae Fix TypeError: string arg required in content_allows_robots()
See #323 an #317.
2019-10-22 14:13:45 +03:00
Chris Mayo
06fdd78f91 Python3: fix TypeError in HttpUrl.read_content()
From test_http_redirect:

  File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content
    line: buf.write(data)
    locals:
      buf = <local> <_io.StringIO object at 0x7f8fe2f45e10>
      buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10>
      data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n'
TypeError: string argument expected, got 'bytes'
2019-09-15 19:42:29 +01:00
EsuS
004632a99b Update references to GitHub project from wummel to linkchecker
Remove all mention of donations.
2019-04-18 19:59:52 +01:00
Antoine Beaupré
9b12b5d66f
workaround new limitation in requests
newer requests do not expose the internal SSL socket object so we
cannot verify certificates. there was work to allow custom
verification routines which we could use, but this never finished:

https://github.com/shazow/urllib3/pull/257

so right now, just treat missing socket information as if the cert was
missing.

Closes: #76
2017-10-02 20:19:25 -04:00
Antoine Beaupré
9d899d1dfa add --no-robots commandline flag
While this flag can be abused, it seems to me like a legitimate use
case that you want to check a fairly small document for mistakes,
which includes references to a website which has a robots.txt that
denies all robots. It turns out that most websites do *not* add a
permission for LinkCheck to use their site, and some sites, like the
Debian BTS for example, are very hostile with bots in general.

Between me using linkcheck and me using my web browser to check those
links one by one, there is not a big difference. In fact, using
linkcheck may be *better* for the website because it will use HEAD
requests instead of a GET, and will not fetch all page elements
(javascript, images, etc) which can often be fairly big.

Besides, hostile users will patch the software themselves: it took me
only a few minutes to disable the check, and a few more to make that
into a proper patch.

By forcing robots.txt without any other option, we are hurting our
good users and not keeping hostile users from doing harm.

The patch is still incomplete, but works. It lacks: documentation and
unit tests.

Closes: #508
2016-05-19 14:43:59 -04:00
Bastian Kleineidam
92c4ca9a5e Debug request headers 2014-09-20 12:16:24 +02:00
Bastian Kleineidam
029c20ed98 More python3 fixes 2014-09-12 21:59:07 +02:00
Bastian Kleineidam
c684918ba6 Ignore urllib3 warnings about invalid SSL certs since we check them ourselves. 2014-09-05 20:00:00 +02:00
Bastian Kleineidam
a665d35feb Use proxies and checker session in robots.txt. 2014-07-14 20:28:28 +02:00
Bastian Kleineidam
1590ab6240 cleanup 2014-07-01 21:12:47 +02:00
Bastian Kleineidam
9a124513e3 Merge branch 'master' of github.com:wummel/linkchecker 2014-07-01 21:11:33 +02:00
wummel
9bb3852edf Merge pull request #515 from Mark-Hetherington/extern-redirect
When following redirections update url.extern
2014-07-01 21:11:13 +02:00
Bastian Kleineidam
12cc12db53 Add get_redirects() function. 2014-07-01 21:11:06 +02:00
Bastian Kleineidam
cde261c009 Parse Refresh: and Content-Location: header values for URLs. 2014-07-01 20:16:43 +02:00
Mark Hetherington
34d83db29c When following redirections update url.extern 2014-05-19 14:59:58 +10:00
Bastian Kleineidam
eaa8a963ec Refactor logging configuration. 2014-05-10 21:23:06 +02:00
Bastian Kleineidam
6bae3e0f49 Use the same request arguments for redirects. 2014-04-23 22:03:44 +02:00
Bastian Kleineidam
6caf654031 Parse Link: heaaders. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
fa26876f67 Don't use encoding detection since it's very slow. 2014-03-27 12:27:11 +01:00
Bastian Kleineidam
49df359317 Some fixes when pyopenssl is used instead of python ssl module. 2014-03-26 19:59:17 +01:00
Bastian Kleineidam
dec0f6c8dc Fix error with SNI checks 2014-03-26 12:38:16 +01:00
Bastian Kleineidam
a8623bc0bc Display SSL info on redirects. 2014-03-26 07:16:03 +01:00
Bastian Kleineidam
be59802569 Set http connection charset. 2014-03-20 21:20:34 +01:00
Bastian Kleineidam
4c76345338 Add certificate valid date info and always set verify flag. 2014-03-19 17:16:42 +01:00
Bastian Kleineidam
9a7ad3a84f Print SSL cipher info for https URLs. 2014-03-19 17:02:34 +01:00
Bastian Kleineidam
ce733ae76b Don't check for robots.txt directives in local html files. 2014-03-19 16:33:22 +01:00
Bastian Kleineidam
6b334dc79b Fix URL result caching. 2014-03-08 19:35:10 +01:00
Bastian Kleineidam
fab2c2da98 Improve content type setting. 2014-03-05 20:12:19 +01:00
Bastian Kleineidam
ef13a3fce1 Implement sitemap and sitemap index parsing. 2014-03-05 19:26:37 +01:00
Bastian Kleineidam
192cfab009 Cleanup of the UrlData.is_* functions 2014-03-05 19:23:16 +01:00
Bastian Kleineidam
6f205a2574 Support checking Sitemap: URLs in robots.txt files. 2014-03-01 20:25:19 +01:00
Bastian Kleineidam
0f0d79c7e0 Remove crawl-delay stuff 2014-03-01 20:01:42 +01:00
Bastian Kleineidam
7b34be590b Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements. 2014-03-01 00:12:34 +01:00
Bastian Kleineidam
c806be5c15 Updated copyright 2014-01-08 22:33:04 +01:00
Bastian Kleineidam
c076e312a2 Send an Accept header. 2014-01-08 19:56:00 +01:00
Bastian Kleineidam
e0a2558b2b Updated copyright. 2013-12-24 07:13:16 +01:00