Commit graph

332 commits

Author SHA1 Message Date
Chris Mayo
8065c75c4e Convert some printf-style strings 2022-11-08 19:21:29 +00:00
Chris Mayo
b6eea83f63
Merge pull request #676 from cjmayo/robotmap
Document sitemaps in linkchecker(1)
2022-10-17 19:25:57 +01:00
Chris Mayo
689557d9af Add logging of MIME types and improve docstrings 2022-10-17 19:21:03 +01:00
Chris Mayo
e88cf49c8f Enable average HTTP request rate to be above 4 per second 2022-10-05 19:28:01 +01:00
Chris Mayo
52b9881820 Separate URL encoding and content encoding
Ensure users of url_data.encoding are using the URL encoding.

Combined since:
5fc01455 ("Decode content when retrieved, use bs4 to detect encoding if non-Unicode", 2019-09-30)
2022-09-29 19:21:11 +01:00
Chris Mayo
a0b28cc0ff Rename url-rate-limited to http-rate-limited
Make consistent with the other warnings:

- The first part of the name represents the checker class in which the
  warning is raised

- Update initial comment
2022-09-06 19:32:24 +01:00
Chris Mayo
d6936ceb91 Add warning url-content-type-unparseable 2022-09-02 19:29:11 +01:00
Chris Mayo
4444a87eb9 Update Requests bug link 2021-12-15 19:34:24 +00:00
Chris Mayo
fe5a34c68f Remove linkcheck.checker.proxysupport
Set up the requests.Session() with the complete proxy configuration
to fix a problem with using an HTTP server as an HTTPS proxy and
potential redirection issues.

Requests handles no_proxy.
2021-12-13 19:25:23 +00:00
Chris Mayo
a04214465a Update HttpUrl.encoding after following redirects 2021-12-06 19:34:31 +00:00
Chris Mayo
0325ecd73f Remove httpurl.HEADER_ENCODING
Unused since:
d91a32822 ("Remove strformat.unicode_safe() and strformat.url_unicode_split()", 2020-07-07)
2021-12-06 19:34:31 +00:00
Chris Mayo
c89c617a58 Ignore an encoding of ISO-8859-1 returned by Requests
ISO-8859-1 is a fallback for Requests and causes us to mangle UTF-8
content.

Requests' utils.py:

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    :rtype: str
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = _parse_content_type_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

    if 'application/json' in content_type:
        # Assume UTF-8 based on RFC 4627: https://www.ietf.org/rfc/rfc4627.txt since the charset was unset
        return 'utf-8'
2021-11-29 19:52:37 +00:00
Chris Mayo
8779c39735 Replace deprecated urllib.parse.split functions 2020-08-22 16:28:53 +01:00
Chris Mayo
1b497389b5
Merge pull request #483 from cjmayo/retryafter
Don't translate "Retry-After" server header field
2020-08-21 16:51:17 +01:00
Chris Mayo
7ee151ebbf Don't translate "Retry-After" server header field
It is defined in RFC 7231.
2020-08-14 19:29:19 +01:00
Chris Mayo
dee21ee9a0 Fix formatting and typos in docstrings 2020-07-25 16:35:48 +01:00
Chris Mayo
53bd5c4d21 Remove HttpUrl.getheader() 2020-07-07 17:25:28 +01:00
Chris Mayo
3fcee872b6 urlparts need to support assignment 2020-07-07 17:25:28 +01:00
Chris Mayo
d91a328224 Remove strformat.unicode_safe() and strformat.url_unicode_split()
All strings support Unicode in Python 3.
2020-07-07 17:25:28 +01:00
Chris Mayo
a6b1eb45b1 Convert to Python 3 super() 2020-06-03 20:06:36 +01:00
Chris Mayo
b9f4864d9e Remove unnecessary commas before closing brackets in linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
a92a684ac4 Run black on linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
97f50e8be1 Remove unused import htmlsoup from checker/httpurl.py
Unused since:

f7337f55 ("Fix error due to an empty html file accessed over http", 2020-05-23)
2020-05-25 19:50:57 +01:00
Marius Gedminas
d0169c46d4
Merge pull request #348 from weshaggard/HandleRateLimiting
Turn status code 429 into warning instead of failure
2020-05-24 16:16:56 +03:00
Marius Gedminas
dcafa2df75
Avoid u-prefixed strings
linkchecker is Python 3 only, all strings are unicode.
2020-05-24 14:50:07 +03:00
Chris Mayo
03b1c4919d Record encoding in debug log messages 2020-05-23 20:01:24 +01:00
Chris Mayo
f7337f55e8 Fix error due to an empty html file accessed over http
Use the already fixed [1] UrlBase.get_content() in HttpUrl.

[1] 5bd1fb4 ("Fix internal error on empty HTML files", 2020-05-21)
2020-05-23 20:01:24 +01:00
Marius Gedminas
f268a90cfb
Merge branch 'master' into HandleRateLimiting 2020-05-23 14:15:52 +03:00
Marius Gedminas
4f3fe5e1c3 Make sure fetching robots.txt uses the configured timeout
Closes #396.
2020-05-22 10:53:33 +03:00
Chris Mayo
a15a2833ca Remove spaces after names in class method definitions
And also nested functions.

This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
fc11d08968 Remove spaces after names in class definitions 2020-05-16 20:19:42 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa Remove u string prefixes 2020-04-30 20:11:59 +01:00
Chris Mayo
4ffdbf2406 Replace MetaRobotsFinder using BeautifulSoup.find() 2020-04-29 20:07:00 +01:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
0795e3c1b4 Replace Parser class using BeautifulSoup.find_all() 2020-04-10 13:51:09 +01:00
Chris Mayo
02e1c389b2 Remove parser flush() and reset()
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
40f43ae41c Create one function to make soup objects 2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06 Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
3ff3d72492 Use BeautifulSoup element attrs directly 2020-04-03 19:24:08 +01:00
Chris Mayo
5b66964afa Remove unused .charset from checker classes
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Wes Haggard
dcdc64e878 Turn status code 429 into warning instead of failure 2020-03-25 16:36:08 -07:00
Chris Mayo
5eaad24641 Use HTTP header encoding for decoding 2020-03-22 19:54:37 +00:00
Marius Gedminas
58b0d5aaae Fix TypeError: string arg required in content_allows_robots()
See #323 an #317.
2019-10-22 14:13:45 +03:00
Chris Mayo
153e53ba03 Reuse soup object used for detecting encoding in the HTML parser 2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf Don't set parser.encoding
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Chris Mayo
06fdd78f91 Python3: fix TypeError in HttpUrl.read_content()
From test_http_redirect:

  File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content
    line: buf.write(data)
    locals:
      buf = <local> <_io.StringIO object at 0x7f8fe2f45e10>
      buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10>
      data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n'
TypeError: string argument expected, got 'bytes'
2019-09-15 19:42:29 +01:00
EsuS
004632a99b Update references to GitHub project from wummel to linkchecker
Remove all mention of donations.
2019-04-18 19:59:52 +01:00
Antoine Beaupré
9b12b5d66f
workaround new limitation in requests
newer requests do not expose the internal SSL socket object so we
cannot verify certificates. there was work to allow custom
verification routines which we could use, but this never finished:

https://github.com/shazow/urllib3/pull/257

so right now, just treat missing socket information as if the cert was
missing.

Closes: #76
2017-10-02 20:19:25 -04:00