linkchecker

mirror of https://github.com/Hopiu/linkchecker.git synced 2026-04-29 10:34:42 +00:00

Author	SHA1	Message	Date
Chris Mayo	736c893707	Merge pull request #377 from cjmayo/tidyten3 Remove u string prefixes	2020-05-13 19:36:54 +01:00
Chris Mayo	b0ea72e8c1	Remove # -*- coding: lines Except for tests that include non-unicode characters: tests/test_po.py tests/test_strformat.py tests/test_url.py tests/checker/test_error.py tests/checker/test_news.py	2020-05-08 10:45:31 +01:00
Chris Mayo	4d3e5abcfa	Remove u string prefixes	2020-04-30 20:11:59 +01:00
Chris Mayo	4ffdbf2406	Replace MetaRobotsFinder using BeautifulSoup.find()	2020-04-29 20:07:00 +01:00
Chris Mayo	ee6628a831	Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py Remove one subpackage and some import lines where htmlutil.linkparse is also being used.	2020-04-18 20:30:45 +01:00
Chris Mayo	0795e3c1b4	Replace Parser class using BeautifulSoup.find_all()	2020-04-10 13:51:09 +01:00
Chris Mayo	02e1c389b2	Remove parser flush() and reset() Remnants of the feed() interface.	2020-04-08 20:03:35 +01:00
Chris Mayo	40f43ae41c	Create one function to make soup objects	2020-04-08 20:03:35 +01:00
Chris Mayo	9d8d251d06	Replace Parser lineno() and column() methods Stop storing this data in Parser object state.	2020-04-08 20:03:35 +01:00
Chris Mayo	3ff3d72492	Use BeautifulSoup element attrs directly	2020-04-03 19:24:08 +01:00
Chris Mayo	5b66964afa	Remove unused .charset from checker classes Unused since: `4f8c2954` ("Don't set parser.encoding", 2019-10-05)	2020-03-30 19:32:30 +01:00
Chris Mayo	5eaad24641	Use HTTP header encoding for decoding	2020-03-22 19:54:37 +00:00
Chris Mayo	153e53ba03	Reuse soup object used for detecting encoding in the HTML parser	2019-10-05 19:38:57 +01:00
Chris Mayo	4f8c2954cf	Don't set parser.encoding Read-only property with new Beautiful Soup parser.	2019-10-05 19:38:57 +01:00
Marius Gedminas	58b0d5aaae	Fix TypeError: string arg required in content_allows_robots() See #323 an #317.	2019-10-22 14:13:45 +03:00
Chris Mayo	06fdd78f91	Python3: fix TypeError in HttpUrl.read_content() From test_http_redirect: File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content line: buf.write(data) locals: buf = <local> <_io.StringIO object at 0x7f8fe2f45e10> buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10> data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n' TypeError: string argument expected, got 'bytes'	2019-09-15 19:42:29 +01:00
EsuS	004632a99b	Update references to GitHub project from wummel to linkchecker Remove all mention of donations.	2019-04-18 19:59:52 +01:00
Antoine Beaupré	9b12b5d66f	workaround new limitation in requests newer requests do not expose the internal SSL socket object so we cannot verify certificates. there was work to allow custom verification routines which we could use, but this never finished: https://github.com/shazow/urllib3/pull/257 so right now, just treat missing socket information as if the cert was missing. Closes: #76	2017-10-02 20:19:25 -04:00
Antoine Beaupré	9d899d1dfa	add --no-robots commandline flag While this flag can be abused, it seems to me like a legitimate use case that you want to check a fairly small document for mistakes, which includes references to a website which has a robots.txt that denies all robots. It turns out that most websites do not add a permission for LinkCheck to use their site, and some sites, like the Debian BTS for example, are very hostile with bots in general. Between me using linkcheck and me using my web browser to check those links one by one, there is not a big difference. In fact, using linkcheck may be better for the website because it will use HEAD requests instead of a GET, and will not fetch all page elements (javascript, images, etc) which can often be fairly big. Besides, hostile users will patch the software themselves: it took me only a few minutes to disable the check, and a few more to make that into a proper patch. By forcing robots.txt without any other option, we are hurting our good users and not keeping hostile users from doing harm. The patch is still incomplete, but works. It lacks: documentation and unit tests. Closes: #508	2016-05-19 14:43:59 -04:00
Bastian Kleineidam	92c4ca9a5e	Debug request headers	2014-09-20 12:16:24 +02:00
Bastian Kleineidam	029c20ed98	More python3 fixes	2014-09-12 21:59:07 +02:00
Bastian Kleineidam	c684918ba6	Ignore urllib3 warnings about invalid SSL certs since we check them ourselves.	2014-09-05 20:00:00 +02:00
Bastian Kleineidam	a665d35feb	Use proxies and checker session in robots.txt.	2014-07-14 20:28:28 +02:00
Bastian Kleineidam	1590ab6240	cleanup	2014-07-01 21:12:47 +02:00
Bastian Kleineidam	9a124513e3	Merge branch 'master' of github.com:wummel/linkchecker	2014-07-01 21:11:33 +02:00
wummel	9bb3852edf	Merge pull request #515 from Mark-Hetherington/extern-redirect When following redirections update url.extern	2014-07-01 21:11:13 +02:00
Bastian Kleineidam	12cc12db53	Add get_redirects() function.	2014-07-01 21:11:06 +02:00
Bastian Kleineidam	cde261c009	Parse Refresh: and Content-Location: header values for URLs.	2014-07-01 20:16:43 +02:00
Mark Hetherington	34d83db29c	When following redirections update url.extern	2014-05-19 14:59:58 +10:00
Bastian Kleineidam	eaa8a963ec	Refactor logging configuration.	2014-05-10 21:23:06 +02:00
Bastian Kleineidam	6bae3e0f49	Use the same request arguments for redirects.	2014-04-23 22:03:44 +02:00
Bastian Kleineidam	6caf654031	Parse Link: heaaders.	2014-04-10 17:50:55 +02:00
Bastian Kleineidam	fa26876f67	Don't use encoding detection since it's very slow.	2014-03-27 12:27:11 +01:00
Bastian Kleineidam	49df359317	Some fixes when pyopenssl is used instead of python ssl module.	2014-03-26 19:59:17 +01:00
Bastian Kleineidam	dec0f6c8dc	Fix error with SNI checks	2014-03-26 12:38:16 +01:00
Bastian Kleineidam	a8623bc0bc	Display SSL info on redirects.	2014-03-26 07:16:03 +01:00
Bastian Kleineidam	be59802569	Set http connection charset.	2014-03-20 21:20:34 +01:00
Bastian Kleineidam	4c76345338	Add certificate valid date info and always set verify flag.	2014-03-19 17:16:42 +01:00
Bastian Kleineidam	9a7ad3a84f	Print SSL cipher info for https URLs.	2014-03-19 17:02:34 +01:00
Bastian Kleineidam	ce733ae76b	Don't check for robots.txt directives in local html files.	2014-03-19 16:33:22 +01:00
Bastian Kleineidam	6b334dc79b	Fix URL result caching.	2014-03-08 19:35:10 +01:00
Bastian Kleineidam	fab2c2da98	Improve content type setting.	2014-03-05 20:12:19 +01:00
Bastian Kleineidam	ef13a3fce1	Implement sitemap and sitemap index parsing.	2014-03-05 19:26:37 +01:00
Bastian Kleineidam	192cfab009	Cleanup of the UrlData.is_* functions	2014-03-05 19:23:16 +01:00
Bastian Kleineidam	6f205a2574	Support checking Sitemap: URLs in robots.txt files.	2014-03-01 20:25:19 +01:00
Bastian Kleineidam	0f0d79c7e0	Remove crawl-delay stuff	2014-03-01 20:01:42 +01:00
Bastian Kleineidam	7b34be590b	Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements.	2014-03-01 00:12:34 +01:00
Bastian Kleineidam	c806be5c15	Updated copyright	2014-01-08 22:33:04 +01:00
Bastian Kleineidam	c076e312a2	Send an Accept header.	2014-01-08 19:56:00 +01:00
Bastian Kleineidam	e0a2558b2b	Updated copyright.	2013-12-24 07:13:16 +01:00

1 2 3 4 5 ...

300 commits