Chris Mayo
beaf9399f8
Elevate redirection to a warning tagged http-redirected
...
Include the HTTP status code and reason in the message.
2023-08-28 19:22:24 +01:00
Chris Mayo
e6da68b7f6
Add linting with Pylint to build workflow
2023-05-03 19:24:53 +01:00
Chris Mayo
8065c75c4e
Convert some printf-style strings
2022-11-08 19:21:29 +00:00
Chris Mayo
b6eea83f63
Merge pull request #676 from cjmayo/robotmap
...
Document sitemaps in linkchecker(1)
2022-10-17 19:25:57 +01:00
Chris Mayo
689557d9af
Add logging of MIME types and improve docstrings
2022-10-17 19:21:03 +01:00
Chris Mayo
e88cf49c8f
Enable average HTTP request rate to be above 4 per second
2022-10-05 19:28:01 +01:00
Chris Mayo
52b9881820
Separate URL encoding and content encoding
...
Ensure users of url_data.encoding are using the URL encoding.
Combined since:
5fc01455 ("Decode content when retrieved, use bs4 to detect encoding if non-Unicode", 2019-09-30)
2022-09-29 19:21:11 +01:00
Chris Mayo
a0b28cc0ff
Rename url-rate-limited to http-rate-limited
...
Make consistent with the other warnings:
- The first part of the name represents the checker class in which the
warning is raised
- Update initial comment
2022-09-06 19:32:24 +01:00
Chris Mayo
d6936ceb91
Add warning url-content-type-unparseable
2022-09-02 19:29:11 +01:00
Chris Mayo
4444a87eb9
Update Requests bug link
2021-12-15 19:34:24 +00:00
Chris Mayo
fe5a34c68f
Remove linkcheck.checker.proxysupport
...
Set up the requests.Session() with the complete proxy configuration
to fix a problem with using an HTTP server as an HTTPS proxy and
potential redirection issues.
Requests handles no_proxy.
2021-12-13 19:25:23 +00:00
Chris Mayo
a04214465a
Update HttpUrl.encoding after following redirects
2021-12-06 19:34:31 +00:00
Chris Mayo
0325ecd73f
Remove httpurl.HEADER_ENCODING
...
Unused since:
d91a32822 ("Remove strformat.unicode_safe() and strformat.url_unicode_split()", 2020-07-07)
2021-12-06 19:34:31 +00:00
Chris Mayo
c89c617a58
Ignore an encoding of ISO-8859-1 returned by Requests
...
ISO-8859-1 is a fallback for Requests and causes us to mangle UTF-8
content.
Requests' utils.py:
def get_encoding_from_headers(headers):
"""Returns encodings from given HTTP Header Dict.
:param headers: dictionary to extract encoding from.
:rtype: str
"""
content_type = headers.get('content-type')
if not content_type:
return None
content_type, params = _parse_content_type_header(content_type)
if 'charset' in params:
return params['charset'].strip("'\"")
if 'text' in content_type:
return 'ISO-8859-1'
if 'application/json' in content_type:
# Assume UTF-8 based on RFC 4627: https://www.ietf.org/rfc/rfc4627.txt since the charset was unset
return 'utf-8'
2021-11-29 19:52:37 +00:00
Chris Mayo
8779c39735
Replace deprecated urllib.parse.split functions
2020-08-22 16:28:53 +01:00
Chris Mayo
1b497389b5
Merge pull request #483 from cjmayo/retryafter
...
Don't translate "Retry-After" server header field
2020-08-21 16:51:17 +01:00
Chris Mayo
7ee151ebbf
Don't translate "Retry-After" server header field
...
It is defined in RFC 7231.
2020-08-14 19:29:19 +01:00
Chris Mayo
dee21ee9a0
Fix formatting and typos in docstrings
2020-07-25 16:35:48 +01:00
Chris Mayo
53bd5c4d21
Remove HttpUrl.getheader()
2020-07-07 17:25:28 +01:00
Chris Mayo
3fcee872b6
urlparts need to support assignment
2020-07-07 17:25:28 +01:00
Chris Mayo
d91a328224
Remove strformat.unicode_safe() and strformat.url_unicode_split()
...
All strings support Unicode in Python 3.
2020-07-07 17:25:28 +01:00
Chris Mayo
a6b1eb45b1
Convert to Python 3 super()
2020-06-03 20:06:36 +01:00
Chris Mayo
b9f4864d9e
Remove unnecessary commas before closing brackets in linkcheck/
2020-05-30 17:01:36 +01:00
Chris Mayo
a92a684ac4
Run black on linkcheck/
2020-05-30 17:01:36 +01:00
Chris Mayo
97f50e8be1
Remove unused import htmlsoup from checker/httpurl.py
...
Unused since:
f7337f55 ("Fix error due to an empty html file accessed over http", 2020-05-23)
2020-05-25 19:50:57 +01:00
Marius Gedminas
d0169c46d4
Merge pull request #348 from weshaggard/HandleRateLimiting
...
Turn status code 429 into warning instead of failure
2020-05-24 16:16:56 +03:00
Marius Gedminas
dcafa2df75
Avoid u-prefixed strings
...
linkchecker is Python 3 only, all strings are unicode.
2020-05-24 14:50:07 +03:00
Chris Mayo
03b1c4919d
Record encoding in debug log messages
2020-05-23 20:01:24 +01:00
Chris Mayo
f7337f55e8
Fix error due to an empty html file accessed over http
...
Use the already fixed [1] UrlBase.get_content() in HttpUrl.
[1] 5bd1fb4 ("Fix internal error on empty HTML files", 2020-05-21)
2020-05-23 20:01:24 +01:00
Marius Gedminas
f268a90cfb
Merge branch 'master' into HandleRateLimiting
2020-05-23 14:15:52 +03:00
Marius Gedminas
4f3fe5e1c3
Make sure fetching robots.txt uses the configured timeout
...
Closes #396 .
2020-05-22 10:53:33 +03:00
Chris Mayo
a15a2833ca
Remove spaces after names in class method definitions
...
And also nested functions.
This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
fc11d08968
Remove spaces after names in class definitions
2020-05-16 20:19:42 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
...
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
b0ea72e8c1
Remove # -*- coding: lines
...
Except for tests that include non-unicode characters:
tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa
Remove u string prefixes
2020-04-30 20:11:59 +01:00
Chris Mayo
4ffdbf2406
Replace MetaRobotsFinder using BeautifulSoup.find()
2020-04-29 20:07:00 +01:00
Chris Mayo
ee6628a831
Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
...
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
0795e3c1b4
Replace Parser class using BeautifulSoup.find_all()
2020-04-10 13:51:09 +01:00
Chris Mayo
02e1c389b2
Remove parser flush() and reset()
...
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
40f43ae41c
Create one function to make soup objects
2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06
Replace Parser lineno() and column() methods
...
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
3ff3d72492
Use BeautifulSoup element attrs directly
2020-04-03 19:24:08 +01:00
Chris Mayo
5b66964afa
Remove unused .charset from checker classes
...
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Wes Haggard
dcdc64e878
Turn status code 429 into warning instead of failure
2020-03-25 16:36:08 -07:00
Chris Mayo
5eaad24641
Use HTTP header encoding for decoding
2020-03-22 19:54:37 +00:00
Marius Gedminas
58b0d5aaae
Fix TypeError: string arg required in content_allows_robots()
...
See #323 an #317 .
2019-10-22 14:13:45 +03:00
Chris Mayo
153e53ba03
Reuse soup object used for detecting encoding in the HTML parser
2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf
Don't set parser.encoding
...
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Chris Mayo
06fdd78f91
Python3: fix TypeError in HttpUrl.read_content()
...
From test_http_redirect:
File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content
line: buf.write(data)
locals:
buf = <local> <_io.StringIO object at 0x7f8fe2f45e10>
buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10>
data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n'
TypeError: string argument expected, got 'bytes'
2019-09-15 19:42:29 +01:00