linkchecker

mirror of https://github.com/Hopiu/linkchecker.git synced 2026-04-20 14:20:59 +00:00

Author	SHA1	Message	Date
Chris Mayo	52b9881820	Separate URL encoding and content encoding Ensure users of url_data.encoding are using the URL encoding. Combined since: `5fc01455` ("Decode content when retrieved, use bs4 to detect encoding if non-Unicode", 2019-09-30)	2022-09-29 19:21:11 +01:00
Chris Mayo	a0b28cc0ff	Rename url-rate-limited to http-rate-limited Make consistent with the other warnings: - The first part of the name represents the checker class in which the warning is raised - Update initial comment	2022-09-06 19:32:24 +01:00
Chris Mayo	d6936ceb91	Add warning url-content-type-unparseable	2022-09-02 19:29:11 +01:00
Chris Mayo	4444a87eb9	Update Requests bug link	2021-12-15 19:34:24 +00:00
Chris Mayo	fe5a34c68f	Remove linkcheck.checker.proxysupport Set up the requests.Session() with the complete proxy configuration to fix a problem with using an HTTP server as an HTTPS proxy and potential redirection issues. Requests handles no_proxy.	2021-12-13 19:25:23 +00:00
Chris Mayo	a04214465a	Update HttpUrl.encoding after following redirects	2021-12-06 19:34:31 +00:00
Chris Mayo	0325ecd73f	Remove httpurl.HEADER_ENCODING Unused since: `d91a32822` ("Remove strformat.unicode_safe() and strformat.url_unicode_split()", 2020-07-07)	2021-12-06 19:34:31 +00:00
Chris Mayo	c89c617a58	Ignore an encoding of ISO-8859-1 returned by Requests ISO-8859-1 is a fallback for Requests and causes us to mangle UTF-8 content. Requests' utils.py: def get_encoding_from_headers(headers): """Returns encodings from given HTTP Header Dict. :param headers: dictionary to extract encoding from. :rtype: str """ content_type = headers.get('content-type') if not content_type: return None content_type, params = _parse_content_type_header(content_type) if 'charset' in params: return params['charset'].strip("'\"") if 'text' in content_type: return 'ISO-8859-1' if 'application/json' in content_type: # Assume UTF-8 based on RFC 4627: https://www.ietf.org/rfc/rfc4627.txt since the charset was unset return 'utf-8'	2021-11-29 19:52:37 +00:00
Chris Mayo	8779c39735	Replace deprecated urllib.parse.split functions	2020-08-22 16:28:53 +01:00
Chris Mayo	1b497389b5	Merge pull request #483 from cjmayo/retryafter Don't translate "Retry-After" server header field	2020-08-21 16:51:17 +01:00
Chris Mayo	7ee151ebbf	Don't translate "Retry-After" server header field It is defined in RFC 7231.	2020-08-14 19:29:19 +01:00
Chris Mayo	dee21ee9a0	Fix formatting and typos in docstrings	2020-07-25 16:35:48 +01:00
Chris Mayo	53bd5c4d21	Remove HttpUrl.getheader()	2020-07-07 17:25:28 +01:00
Chris Mayo	3fcee872b6	urlparts need to support assignment	2020-07-07 17:25:28 +01:00
Chris Mayo	d91a328224	Remove strformat.unicode_safe() and strformat.url_unicode_split() All strings support Unicode in Python 3.	2020-07-07 17:25:28 +01:00
Chris Mayo	a6b1eb45b1	Convert to Python 3 super()	2020-06-03 20:06:36 +01:00
Chris Mayo	b9f4864d9e	Remove unnecessary commas before closing brackets in linkcheck/	2020-05-30 17:01:36 +01:00
Chris Mayo	a92a684ac4	Run black on linkcheck/	2020-05-30 17:01:36 +01:00
Chris Mayo	97f50e8be1	Remove unused import htmlsoup from checker/httpurl.py Unused since: `f7337f55` ("Fix error due to an empty html file accessed over http", 2020-05-23)	2020-05-25 19:50:57 +01:00
Marius Gedminas	d0169c46d4	Merge pull request #348 from weshaggard/HandleRateLimiting Turn status code 429 into warning instead of failure	2020-05-24 16:16:56 +03:00
Marius Gedminas	dcafa2df75	Avoid u-prefixed strings linkchecker is Python 3 only, all strings are unicode.	2020-05-24 14:50:07 +03:00
Chris Mayo	03b1c4919d	Record encoding in debug log messages	2020-05-23 20:01:24 +01:00
Chris Mayo	f7337f55e8	Fix error due to an empty html file accessed over http Use the already fixed [1] UrlBase.get_content() in HttpUrl. [1] `5bd1fb4` ("Fix internal error on empty HTML files", 2020-05-21)	2020-05-23 20:01:24 +01:00
Marius Gedminas	f268a90cfb	Merge branch 'master' into HandleRateLimiting	2020-05-23 14:15:52 +03:00
Marius Gedminas	4f3fe5e1c3	Make sure fetching robots.txt uses the configured timeout Closes #396.	2020-05-22 10:53:33 +03:00
Chris Mayo	a15a2833ca	Remove spaces after names in class method definitions And also nested functions. This is a PEP 8 convention, E211.	2020-05-16 20:19:42 +01:00
Chris Mayo	fc11d08968	Remove spaces after names in class definitions	2020-05-16 20:19:42 +01:00
Chris Mayo	736c893707	Merge pull request #377 from cjmayo/tidyten3 Remove u string prefixes	2020-05-13 19:36:54 +01:00
Chris Mayo	b0ea72e8c1	Remove # -*- coding: lines Except for tests that include non-unicode characters: tests/test_po.py tests/test_strformat.py tests/test_url.py tests/checker/test_error.py tests/checker/test_news.py	2020-05-08 10:45:31 +01:00
Chris Mayo	4d3e5abcfa	Remove u string prefixes	2020-04-30 20:11:59 +01:00
Chris Mayo	4ffdbf2406	Replace MetaRobotsFinder using BeautifulSoup.find()	2020-04-29 20:07:00 +01:00
Chris Mayo	ee6628a831	Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py Remove one subpackage and some import lines where htmlutil.linkparse is also being used.	2020-04-18 20:30:45 +01:00
Chris Mayo	0795e3c1b4	Replace Parser class using BeautifulSoup.find_all()	2020-04-10 13:51:09 +01:00
Chris Mayo	02e1c389b2	Remove parser flush() and reset() Remnants of the feed() interface.	2020-04-08 20:03:35 +01:00
Chris Mayo	40f43ae41c	Create one function to make soup objects	2020-04-08 20:03:35 +01:00
Chris Mayo	9d8d251d06	Replace Parser lineno() and column() methods Stop storing this data in Parser object state.	2020-04-08 20:03:35 +01:00
Chris Mayo	3ff3d72492	Use BeautifulSoup element attrs directly	2020-04-03 19:24:08 +01:00
Chris Mayo	5b66964afa	Remove unused .charset from checker classes Unused since: `4f8c2954` ("Don't set parser.encoding", 2019-10-05)	2020-03-30 19:32:30 +01:00
Wes Haggard	dcdc64e878	Turn status code 429 into warning instead of failure	2020-03-25 16:36:08 -07:00
Chris Mayo	5eaad24641	Use HTTP header encoding for decoding	2020-03-22 19:54:37 +00:00
Marius Gedminas	58b0d5aaae	Fix TypeError: string arg required in content_allows_robots() See #323 an #317.	2019-10-22 14:13:45 +03:00
Chris Mayo	153e53ba03	Reuse soup object used for detecting encoding in the HTML parser	2019-10-05 19:38:57 +01:00
Chris Mayo	4f8c2954cf	Don't set parser.encoding Read-only property with new Beautiful Soup parser.	2019-10-05 19:38:57 +01:00
Chris Mayo	06fdd78f91	Python3: fix TypeError in HttpUrl.read_content() From test_http_redirect: File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content line: buf.write(data) locals: buf = <local> <_io.StringIO object at 0x7f8fe2f45e10> buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10> data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n' TypeError: string argument expected, got 'bytes'	2019-09-15 19:42:29 +01:00
EsuS	004632a99b	Update references to GitHub project from wummel to linkchecker Remove all mention of donations.	2019-04-18 19:59:52 +01:00
Antoine Beaupré	9b12b5d66f	workaround new limitation in requests newer requests do not expose the internal SSL socket object so we cannot verify certificates. there was work to allow custom verification routines which we could use, but this never finished: https://github.com/shazow/urllib3/pull/257 so right now, just treat missing socket information as if the cert was missing. Closes: #76	2017-10-02 20:19:25 -04:00
Antoine Beaupré	9d899d1dfa	add --no-robots commandline flag While this flag can be abused, it seems to me like a legitimate use case that you want to check a fairly small document for mistakes, which includes references to a website which has a robots.txt that denies all robots. It turns out that most websites do not add a permission for LinkCheck to use their site, and some sites, like the Debian BTS for example, are very hostile with bots in general. Between me using linkcheck and me using my web browser to check those links one by one, there is not a big difference. In fact, using linkcheck may be better for the website because it will use HEAD requests instead of a GET, and will not fetch all page elements (javascript, images, etc) which can often be fairly big. Besides, hostile users will patch the software themselves: it took me only a few minutes to disable the check, and a few more to make that into a proper patch. By forcing robots.txt without any other option, we are hurting our good users and not keeping hostile users from doing harm. The patch is still incomplete, but works. It lacks: documentation and unit tests. Closes: #508	2016-05-19 14:43:59 -04:00
Bastian Kleineidam	92c4ca9a5e	Debug request headers	2014-09-20 12:16:24 +02:00
Bastian Kleineidam	029c20ed98	More python3 fixes	2014-09-12 21:59:07 +02:00
Bastian Kleineidam	c684918ba6	Ignore urllib3 warnings about invalid SSL certs since we check them ourselves.	2014-09-05 20:00:00 +02:00

1 2 3 4 5 ...

328 commits