Chris Mayo
dee21ee9a0
Fix formatting and typos in docstrings
2020-07-25 16:35:48 +01:00
Chris Mayo
53bd5c4d21
Remove HttpUrl.getheader()
2020-07-07 17:25:28 +01:00
Chris Mayo
3fcee872b6
urlparts need to support assignment
2020-07-07 17:25:28 +01:00
Chris Mayo
d91a328224
Remove strformat.unicode_safe() and strformat.url_unicode_split()
...
All strings support Unicode in Python 3.
2020-07-07 17:25:28 +01:00
Chris Mayo
a6b1eb45b1
Convert to Python 3 super()
2020-06-03 20:06:36 +01:00
Chris Mayo
b9f4864d9e
Remove unnecessary commas before closing brackets in linkcheck/
2020-05-30 17:01:36 +01:00
Chris Mayo
a92a684ac4
Run black on linkcheck/
2020-05-30 17:01:36 +01:00
Chris Mayo
97f50e8be1
Remove unused import htmlsoup from checker/httpurl.py
...
Unused since:
f7337f55 ("Fix error due to an empty html file accessed over http", 2020-05-23)
2020-05-25 19:50:57 +01:00
Marius Gedminas
d0169c46d4
Merge pull request #348 from weshaggard/HandleRateLimiting
...
Turn status code 429 into warning instead of failure
2020-05-24 16:16:56 +03:00
Marius Gedminas
dcafa2df75
Avoid u-prefixed strings
...
linkchecker is Python 3 only, all strings are unicode.
2020-05-24 14:50:07 +03:00
Chris Mayo
03b1c4919d
Record encoding in debug log messages
2020-05-23 20:01:24 +01:00
Chris Mayo
f7337f55e8
Fix error due to an empty html file accessed over http
...
Use the already fixed [1] UrlBase.get_content() in HttpUrl.
[1] 5bd1fb4 ("Fix internal error on empty HTML files", 2020-05-21)
2020-05-23 20:01:24 +01:00
Marius Gedminas
f268a90cfb
Merge branch 'master' into HandleRateLimiting
2020-05-23 14:15:52 +03:00
Marius Gedminas
4f3fe5e1c3
Make sure fetching robots.txt uses the configured timeout
...
Closes #396 .
2020-05-22 10:53:33 +03:00
Chris Mayo
a15a2833ca
Remove spaces after names in class method definitions
...
And also nested functions.
This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
fc11d08968
Remove spaces after names in class definitions
2020-05-16 20:19:42 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
...
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
b0ea72e8c1
Remove # -*- coding: lines
...
Except for tests that include non-unicode characters:
tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa
Remove u string prefixes
2020-04-30 20:11:59 +01:00
Chris Mayo
4ffdbf2406
Replace MetaRobotsFinder using BeautifulSoup.find()
2020-04-29 20:07:00 +01:00
Chris Mayo
ee6628a831
Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
...
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
0795e3c1b4
Replace Parser class using BeautifulSoup.find_all()
2020-04-10 13:51:09 +01:00
Chris Mayo
02e1c389b2
Remove parser flush() and reset()
...
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
40f43ae41c
Create one function to make soup objects
2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06
Replace Parser lineno() and column() methods
...
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
3ff3d72492
Use BeautifulSoup element attrs directly
2020-04-03 19:24:08 +01:00
Chris Mayo
5b66964afa
Remove unused .charset from checker classes
...
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Wes Haggard
dcdc64e878
Turn status code 429 into warning instead of failure
2020-03-25 16:36:08 -07:00
Chris Mayo
5eaad24641
Use HTTP header encoding for decoding
2020-03-22 19:54:37 +00:00
Marius Gedminas
58b0d5aaae
Fix TypeError: string arg required in content_allows_robots()
...
See #323 an #317 .
2019-10-22 14:13:45 +03:00
Chris Mayo
153e53ba03
Reuse soup object used for detecting encoding in the HTML parser
2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf
Don't set parser.encoding
...
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Chris Mayo
06fdd78f91
Python3: fix TypeError in HttpUrl.read_content()
...
From test_http_redirect:
File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content
line: buf.write(data)
locals:
buf = <local> <_io.StringIO object at 0x7f8fe2f45e10>
buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10>
data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n'
TypeError: string argument expected, got 'bytes'
2019-09-15 19:42:29 +01:00
EsuS
004632a99b
Update references to GitHub project from wummel to linkchecker
...
Remove all mention of donations.
2019-04-18 19:59:52 +01:00
Antoine Beaupré
9b12b5d66f
workaround new limitation in requests
...
newer requests do not expose the internal SSL socket object so we
cannot verify certificates. there was work to allow custom
verification routines which we could use, but this never finished:
https://github.com/shazow/urllib3/pull/257
so right now, just treat missing socket information as if the cert was
missing.
Closes : #76
2017-10-02 20:19:25 -04:00
Antoine Beaupré
9d899d1dfa
add --no-robots commandline flag
...
While this flag can be abused, it seems to me like a legitimate use
case that you want to check a fairly small document for mistakes,
which includes references to a website which has a robots.txt that
denies all robots. It turns out that most websites do *not* add a
permission for LinkCheck to use their site, and some sites, like the
Debian BTS for example, are very hostile with bots in general.
Between me using linkcheck and me using my web browser to check those
links one by one, there is not a big difference. In fact, using
linkcheck may be *better* for the website because it will use HEAD
requests instead of a GET, and will not fetch all page elements
(javascript, images, etc) which can often be fairly big.
Besides, hostile users will patch the software themselves: it took me
only a few minutes to disable the check, and a few more to make that
into a proper patch.
By forcing robots.txt without any other option, we are hurting our
good users and not keeping hostile users from doing harm.
The patch is still incomplete, but works. It lacks: documentation and
unit tests.
Closes : #508
2016-05-19 14:43:59 -04:00
Bastian Kleineidam
92c4ca9a5e
Debug request headers
2014-09-20 12:16:24 +02:00
Bastian Kleineidam
029c20ed98
More python3 fixes
2014-09-12 21:59:07 +02:00
Bastian Kleineidam
c684918ba6
Ignore urllib3 warnings about invalid SSL certs since we check them ourselves.
2014-09-05 20:00:00 +02:00
Bastian Kleineidam
a665d35feb
Use proxies and checker session in robots.txt.
2014-07-14 20:28:28 +02:00
Bastian Kleineidam
1590ab6240
cleanup
2014-07-01 21:12:47 +02:00
Bastian Kleineidam
9a124513e3
Merge branch 'master' of github.com:wummel/linkchecker
2014-07-01 21:11:33 +02:00
wummel
9bb3852edf
Merge pull request #515 from Mark-Hetherington/extern-redirect
...
When following redirections update url.extern
2014-07-01 21:11:13 +02:00
Bastian Kleineidam
12cc12db53
Add get_redirects() function.
2014-07-01 21:11:06 +02:00
Bastian Kleineidam
cde261c009
Parse Refresh: and Content-Location: header values for URLs.
2014-07-01 20:16:43 +02:00
Mark Hetherington
34d83db29c
When following redirections update url.extern
2014-05-19 14:59:58 +10:00
Bastian Kleineidam
eaa8a963ec
Refactor logging configuration.
2014-05-10 21:23:06 +02:00
Bastian Kleineidam
6bae3e0f49
Use the same request arguments for redirects.
2014-04-23 22:03:44 +02:00
Bastian Kleineidam
6caf654031
Parse Link: heaaders.
2014-04-10 17:50:55 +02:00
Bastian Kleineidam
fa26876f67
Don't use encoding detection since it's very slow.
2014-03-27 12:27:11 +01:00