linkchecker

mirror of https://github.com/Hopiu/linkchecker.git synced 2026-05-20 12:21:53 +00:00

Author	SHA1	Message	Date
Chris Mayo	b9f4864d9e	Remove unnecessary commas before closing brackets in linkcheck/	2020-05-30 17:01:36 +01:00
Chris Mayo	a92a684ac4	Run black on linkcheck/	2020-05-30 17:01:36 +01:00
Chris Mayo	488e72c81f	Ignore imports providing aliases in subpackages	2020-05-26 19:49:59 +01:00
Chris Mayo	97f50e8be1	Remove unused import htmlsoup from checker/httpurl.py Unused since: `f7337f55` ("Fix error due to an empty html file accessed over http", 2020-05-23)	2020-05-25 19:50:57 +01:00
Marius Gedminas	d0169c46d4	Merge pull request #348 from weshaggard/HandleRateLimiting Turn status code 429 into warning instead of failure	2020-05-24 16:16:56 +03:00
Marius Gedminas	dcafa2df75	Avoid u-prefixed strings linkchecker is Python 3 only, all strings are unicode.	2020-05-24 14:50:07 +03:00
Chris Mayo	03b1c4919d	Record encoding in debug log messages	2020-05-23 20:01:24 +01:00
Chris Mayo	f7337f55e8	Fix error due to an empty html file accessed over http Use the already fixed [1] UrlBase.get_content() in HttpUrl. [1] `5bd1fb4` ("Fix internal error on empty HTML files", 2020-05-21)	2020-05-23 20:01:24 +01:00
Marius Gedminas	f268a90cfb	Merge branch 'master' into HandleRateLimiting	2020-05-23 14:15:52 +03:00
Marius Gedminas	4f3fe5e1c3	Make sure fetching robots.txt uses the configured timeout Closes #396.	2020-05-22 10:53:33 +03:00
Marius Gedminas	c60d7c66e4	Clarify the decision to fall back to Latin-1	2020-05-21 19:35:39 +03:00
Marius Gedminas	5bd1fb4e36	Fix internal error on empty HTML files When BeautifulSoup finds an empty file on disk, it sets original_encoding to None. It doesn't matter what encoding we pick for empty files, so let's just pick one. I don't know if there are any circumstances where BeautifulSoup might set the encoding to None for a non-empty file. Closes #392.	2020-05-21 19:01:33 +03:00
Chris Mayo	6bddd4ac60	Remove str_text from checker/	2020-05-19 19:56:42 +01:00
Chris Mayo	a127902607	Replace str_text in asserts	2020-05-19 19:56:42 +01:00
Chris Mayo	a15a2833ca	Remove spaces after names in class method definitions And also nested functions. This is a PEP 8 convention, E211.	2020-05-16 20:19:42 +01:00
Chris Mayo	1663e10fe7	Remove spaces after names in function definitions This is a PEP 8 convention, E211.	2020-05-16 20:19:42 +01:00
Chris Mayo	fc11d08968	Remove spaces after names in class definitions	2020-05-16 20:19:42 +01:00
Chris Mayo	f8c9faec1b	Remove Python 2 cStringIO imports	2020-05-15 19:37:04 +01:00
Chris Mayo	bda9612273	Make html.escape Python 3 only	2020-05-14 20:15:28 +01:00
Chris Mayo	42de609f8e	Make urllib imports Python 3 only	2020-05-14 20:15:28 +01:00
Chris Mayo	3c661a83d0	Replace parse_host_port() in checker.proxysupport with url.splitport()	2020-05-14 20:15:28 +01:00
Chris Mayo	736c893707	Merge pull request #377 from cjmayo/tidyten3 Remove u string prefixes	2020-05-13 19:36:54 +01:00
Chris Mayo	44e81d27dd	Remove inheriting object All Python 3 classes are new-style.	2020-05-08 10:45:31 +01:00
Chris Mayo	b0ea72e8c1	Remove # -*- coding: lines Except for tests that include non-unicode characters: tests/test_po.py tests/test_strformat.py tests/test_url.py tests/checker/test_error.py tests/checker/test_news.py	2020-05-08 10:45:31 +01:00
Chris Mayo	4d3e5abcfa	Remove u string prefixes	2020-04-30 20:11:59 +01:00
Chris Mayo	4ffdbf2406	Replace MetaRobotsFinder using BeautifulSoup.find()	2020-04-29 20:07:00 +01:00
anarcat	183d483074	Merge pull request #365 from cjmayo/tidyten1 Remove use of the future package	2020-04-26 12:02:30 -04:00
Chris Mayo	ee6628a831	Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py Remove one subpackage and some import lines where htmlutil.linkparse is also being used.	2020-04-18 20:30:45 +01:00
Chris Mayo	f5e7f3a382	Remove use of the future package It was providing Python 2 compatibility.	2020-04-15 19:49:16 +01:00
Chris Mayo	0795e3c1b4	Replace Parser class using BeautifulSoup.find_all()	2020-04-10 13:51:09 +01:00
Chris Mayo	02e1c389b2	Remove parser flush() and reset() Remnants of the feed() interface.	2020-04-08 20:03:35 +01:00
Chris Mayo	40f43ae41c	Create one function to make soup objects	2020-04-08 20:03:35 +01:00
Chris Mayo	9d8d251d06	Replace Parser lineno() and column() methods Stop storing this data in Parser object state.	2020-04-08 20:03:35 +01:00
Chris Mayo	3ff3d72492	Use BeautifulSoup element attrs directly	2020-04-03 19:24:08 +01:00
Chris Mayo	28701e291a	Remove use of Python 2 unicode() and related u prefixes Several instances for MS Windows left unchanged.	2020-04-01 19:39:50 +01:00
Chris Mayo	5b66964afa	Remove unused .charset from checker classes Unused since: `4f8c2954` ("Don't set parser.encoding", 2019-10-05)	2020-03-30 19:32:30 +01:00
Wes Haggard	dcdc64e878	Turn status code 429 into warning instead of failure	2020-03-25 16:36:08 -07:00
Chris Mayo	5eaad24641	Use HTTP header encoding for decoding	2020-03-22 19:54:37 +00:00
Marius Gedminas	58b0d5aaae	Fix TypeError: string arg required in content_allows_robots() See #323 an #317.	2019-10-22 14:13:45 +03:00
anarcat	f73ba54a2a	Merge pull request #308 from cjmayo/decode Decode content when retrieved	2019-10-10 09:46:32 -04:00
Chris Mayo	a9f147c347	Update fileutil.pathencode() because paths are now strings	2019-10-05 19:38:57 +01:00
Chris Mayo	646e138166	Pass encoding when unquoting Else non-UTF-8 codes are misinterpreted: >>> from urllib import parse >>> parse.unquote("%FF") '�' >>> parse.unquote("%FF", "latin1") 'ÿ'	2019-10-05 19:38:57 +01:00
Chris Mayo	153e53ba03	Reuse soup object used for detecting encoding in the HTML parser	2019-10-05 19:38:57 +01:00
Chris Mayo	607328d5c5	Support Beautiful Soup line numbers	2019-10-05 19:38:57 +01:00
Chris Mayo	4f8c2954cf	Don't set parser.encoding Read-only property with new Beautiful Soup parser.	2019-10-05 19:38:57 +01:00
Chris Mayo	5732606c58	Remove urlutil.decode_for_unquote() Not needed since all content is now being decoded on retrieval. Added by: `a6643034` ("Python3: decode parts before submitting them to urllib.quote()", 2018-01-05)	2019-10-04 19:37:09 +01:00
Chris Mayo	2776eb5f52	Revert "Python3: fix opening file URLs" This reverts commit `4c9ec511b5`.	2019-10-04 19:37:09 +01:00
Chris Mayo	5fc01455b7	Decode content when retrieved, use bs4 to detect encoding if non-Unicode UrlBase has been modified as follows: - the "data" variable now holds bytes - decoded content is stored in a new variable "text" - functionality from get_content() has been split out into get_raw_content() which returns "data" and download_content() which calls read_content() and sets the download related variables. This allows for subclasses to do their own decoding and parsers to use bytes.	2019-09-30 19:46:24 +01:00
Chris Mayo	53cd9475b5	Replace deprecated cgi.escape html provided for Python 2 by future https://python-future.org/compatible_idioms.html#html-escaping-and-entities	2019-09-17 20:25:05 +01:00
anarcat	2c7573b3b8	Merge pull request #300 from cjmayo/python3_43 {python3_43} Python3: fix for test_telnet in urlbase.py	2019-09-16 10:08:18 -04:00
anarcat	bec68f237b	Merge pull request #299 from cjmayo/python3_42 {python3_42} fixes for Python 3: fix telneturl	2019-09-16 10:07:55 -04:00
anarcat	27d672c78b	Merge pull request #297 from cjmayo/python3_40 {python3_40} Python3: fixes form checker/__init__.py	2019-09-16 10:06:05 -04:00
Petr Dlouhý	c2af88ad2e	Python3: fix for test_telnet in urlbase.py	2019-09-15 19:49:26 +01:00
Petr Dlouhý	a2e67af7b4	fixes for Python 3: fix telneturl	2019-09-15 19:49:18 +01:00
Petr Dlouhý	bb542b00e9	Python3: fixes form checker/__init__.py	2019-09-15 19:49:00 +01:00
Chris Mayo	06fdd78f91	Python3: fix TypeError in HttpUrl.read_content() From test_http_redirect: File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content line: buf.write(data) locals: buf = <local> <_io.StringIO object at 0x7f8fe2f45e10> buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10> data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n' TypeError: string argument expected, got 'bytes'	2019-09-15 19:42:29 +01:00
Chris Mayo	4c9ec511b5	Python3: fix opening file URLs urllib.request.urlopen() expects a string or Request object.	2019-09-12 19:58:27 +01:00
anarcat	2239458966	Merge pull request #285 from cjmayo/python3_34 {python3_34} fixes for Python 3: fix test_misc	2019-09-11 09:48:14 -04:00
anarcat	492058a360	Merge pull request #281 from cjmayo/python3_30 {python3_30} Python3: fix decoding strings	2019-09-11 09:47:10 -04:00
Petr Dlouhý	f272206110	Python3: fix decoding strings	2019-09-10 19:52:23 +01:00
Petr Dlouhý	e10f25b968	fixes for Python 3: fix running problems in Python 3	2019-09-10 19:30:09 +01:00
Petr Dlouhý	129a68da38	fixes for Python 3: fix test_misc	2019-09-09 19:51:30 +01:00
Petr Dlouhý	ffb0a68ff7	Python3: fix fileurl	2019-09-05 19:41:53 +01:00
anarcat	59fe9ed876	Merge pull request #228 from cjmayo/python3_18 {python3_18} Python3: fix unicode in urlbase	2019-04-25 16:17:00 -04:00
anarcat	70f0bbf225	Merge pull request #250 from cjmayo/ftpserver Get FtpServerTest working by updating to current pyftpdlib API	2019-04-25 16:16:33 -04:00
Petr Dlouhý	e92b0a9f7b	Python3: fix unicode in urlbase	2019-04-25 19:57:45 +01:00
Petr Dlouhý	b3881ce3b5	Python3: fix urlbase, strformat and others	2019-04-25 19:57:45 +01:00
anarcat	bb0a1e1992	Merge pull request #242 from cjmayo/wummel Update references to GitHub project from wummel to linkchecker	2019-04-24 10:58:15 -04:00
anarcat	ee8667e1ca	Merge pull request #229 from cjmayo/python3_19 {python3_19} Python3: fix unicode in fileurl	2019-04-24 10:57:45 -04:00
Chris Mayo	f60810b050	Fix Python 3 "TypeError: decoding str is not supported" in FtpUrl.cwd	2019-04-22 19:34:46 +01:00
Petr Dlouhý	b40f4722c7	Python3: fix unicode in fileurl	2019-04-19 20:42:38 +01:00
EsuS	004632a99b	Update references to GitHub project from wummel to linkchecker Remove all mention of donations.	2019-04-18 19:59:52 +01:00
Petr Dlouhý	bc99dc51de	Python3: fix HtmlParser	2019-04-18 19:35:16 +01:00
Petr Dlouhý	4acabf5cb5	fix urllib imports	2019-04-09 20:09:35 +01:00
gerdneuman	de6a82b378	Added whatsapp:// to ignored protocols Fixes https://github.com/wummel/linkchecker/issues/595	2018-08-09 13:49:15 +02:00
Petr Dlouhý	256202a20b	fixes for Python 3: fix proxysuport	2018-01-19 09:52:43 +01:00
Antoine Beaupré	9b12b5d66f	workaround new limitation in requests newer requests do not expose the internal SSL socket object so we cannot verify certificates. there was work to allow custom verification routines which we could use, but this never finished: https://github.com/shazow/urllib3/pull/257 so right now, just treat missing socket information as if the cert was missing. Closes: #76	2017-10-02 20:19:25 -04:00
Graham Seaman	233e7dcf68	Allow wayback-format urls without affecting atom 'feed' urls	2017-02-09 11:43:45 +00:00
Antoine Beaupré	9d899d1dfa	add --no-robots commandline flag While this flag can be abused, it seems to me like a legitimate use case that you want to check a fairly small document for mistakes, which includes references to a website which has a robots.txt that denies all robots. It turns out that most websites do not add a permission for LinkCheck to use their site, and some sites, like the Debian BTS for example, are very hostile with bots in general. Between me using linkcheck and me using my web browser to check those links one by one, there is not a big difference. In fact, using linkcheck may be better for the website because it will use HEAD requests instead of a GET, and will not fetch all page elements (javascript, images, etc) which can often be fairly big. Besides, hostile users will patch the software themselves: it took me only a few minutes to disable the check, and a few more to make that into a proper patch. By forcing robots.txt without any other option, we are hurting our good users and not keeping hostile users from doing harm. The patch is still incomplete, but works. It lacks: documentation and unit tests. Closes: #508	2016-05-19 14:43:59 -04:00
Bastian Kleineidam	92c4ca9a5e	Debug request headers	2014-09-20 12:16:24 +02:00
Bastian Kleineidam	029c20ed98	More python3 fixes	2014-09-12 21:59:07 +02:00
Bastian Kleineidam	35eb30432e	Added some Python3 fixes.	2014-09-12 19:36:30 +02:00
Bastian Kleineidam	06c6b80ed3	Fix proxy support.	2014-09-05 22:48:10 +02:00
Bastian Kleineidam	ee4545399d	Support itms-services: URLs. #532	2014-09-05 21:06:10 +02:00
Bastian Kleineidam	c684918ba6	Ignore urllib3 warnings about invalid SSL certs since we check them ourselves.	2014-09-05 20:00:00 +02:00
Bastian Kleineidam	2354f16dbb	Catch urllib3 errors.	2014-09-05 19:59:28 +02:00
Bastian Kleineidam	a665d35feb	Use proxies and checker session in robots.txt.	2014-07-14 20:28:28 +02:00
Bastian Kleineidam	266e9e189f	Further code cleanup.	2014-07-14 20:14:00 +02:00
Bastian Kleineidam	7838521b6e	Code cleanup.	2014-07-14 19:49:01 +02:00
Bastian Kleineidam	eafa1ed2da	Updated unknown URL schemes.	2014-07-13 21:51:53 +02:00
Bastian Kleineidam	0fa7ed2699	Fix empty URL handling.	2014-07-03 23:34:40 +02:00
Bastian Kleineidam	1590ab6240	cleanup	2014-07-01 21:12:47 +02:00
Bastian Kleineidam	9a124513e3	Merge branch 'master' of github.com:wummel/linkchecker	2014-07-01 21:11:33 +02:00
wummel	9bb3852edf	Merge pull request #515 from Mark-Hetherington/extern-redirect When following redirections update url.extern	2014-07-01 21:11:13 +02:00
Bastian Kleineidam	12cc12db53	Add get_redirects() function.	2014-07-01 21:11:06 +02:00
Bastian Kleineidam	cde261c009	Parse Refresh: and Content-Location: header values for URLs.	2014-07-01 20:16:43 +02:00
Bastian Kleineidam	c3ec91ac6d	Fix intern URL search pattern.	2014-06-13 23:52:21 +02:00
Bastian Kleineidam	ad8eb424f3	Merge Mark-Hetherington-xml-parse-warn with slight modifications.	2014-06-13 20:50:37 +02:00
Mark Hetherington	34d83db29c	When following redirections update url.extern	2014-05-19 14:59:58 +10:00
Bastian Kleineidam	eaa8a963ec	Refactor logging configuration.	2014-05-10 21:23:06 +02:00

1 2 3 4 5 ...

938 commits