Commit graph

938 commits

Author SHA1 Message Date
Chris Mayo
52b9881820 Separate URL encoding and content encoding
Ensure users of url_data.encoding are using the URL encoding.

Combined since:
5fc01455 ("Decode content when retrieved, use bs4 to detect encoding if non-Unicode", 2019-09-30)
2022-09-29 19:21:11 +01:00
Chris Mayo
61071fc5dc
Merge pull request #668 from cjmayo/defaults
Clarify default values in initial linkcheckerrc and elsewhere
2022-09-28 19:36:44 +01:00
Lukas Pirl
8c959589c3
add option to ignore specific errors for specific URLs 2022-09-25 22:52:04 +02:00
Chris Mayo
130347f223 Remove unused WARN_IGNORE_URL
URL ignored was changed to an info message in:
7b34be59 ("Introduce check plugins, use Python requests for http/s
connections, and some code cleanups and improvements.", 2014-03-01)
2022-09-22 19:24:55 +01:00
Chris Mayo
ed8e17137c Add gemini scheme 2022-09-16 19:21:32 +01:00
Chris Mayo
25ce4b854c Update IANA schemes 2022-09-16 19:21:32 +01:00
Chris Mayo
a0b28cc0ff Rename url-rate-limited to http-rate-limited
Make consistent with the other warnings:

- The first part of the name represents the checker class in which the
  warning is raised

- Update initial comment
2022-09-06 19:32:24 +01:00
Chris Mayo
3c7fb5b571 Fix checking directory containing Unicode filenames
Non-Unicode filenames are not supported.

sys.platform has not returned "linux2" since Python 3.3.
2022-09-05 19:28:40 +01:00
Chris Mayo
c79bc07cee Add MIME type application/vnd.adobe.flash.movie 2022-09-02 19:29:11 +01:00
Chris Mayo
d6936ceb91 Add warning url-content-type-unparseable 2022-09-02 19:29:11 +01:00
Kian-Meng Ang
a70ea9ea14 Fix typos
Found via `codespell ./linkcheck/ ./tests ./doc/man/en -L bu,noone,fo,pres,shttp`
2022-09-02 17:20:02 +08:00
Malte Gerth
cc48a09308 Add Telegram and WhatsApp link schemes 2022-02-06 23:41:33 +01:00
Malte Gerth
067dd8edbb Update IANA schemes 2022-02-06 23:40:36 +01:00
Chris Mayo
4444a87eb9 Update Requests bug link 2021-12-15 19:34:24 +00:00
Chris Mayo
76815bcf47 Don't guess the URL for files that end in .html
Fixes:
linkchecker ftp.html
failing looking for ftp://ftp.html
2021-12-13 19:31:13 +00:00
Chris Mayo
fe5a34c68f Remove linkcheck.checker.proxysupport
Set up the requests.Session() with the complete proxy configuration
to fix a problem with using an HTTP server as an HTTPS proxy and
potential redirection issues.

Requests handles no_proxy.
2021-12-13 19:25:23 +00:00
Chris Mayo
a60648e348 Remove support for ftp_proxy
Was limited to HTTP proxy servers and prevents simplifying and fixing
HTTP proxy support.
2021-12-13 19:25:23 +00:00
Chris Mayo
f2e5a435e3 Remove unused ProxySupport.proxyauth
Not used since:
7b34be590 ("Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements.", 2014-03-01)
2021-12-13 19:25:23 +00:00
Chris Mayo
a04214465a Update HttpUrl.encoding after following redirects 2021-12-06 19:34:31 +00:00
Chris Mayo
0325ecd73f Remove httpurl.HEADER_ENCODING
Unused since:
d91a32822 ("Remove strformat.unicode_safe() and strformat.url_unicode_split()", 2020-07-07)
2021-12-06 19:34:31 +00:00
Chris Mayo
c89c617a58 Ignore an encoding of ISO-8859-1 returned by Requests
ISO-8859-1 is a fallback for Requests and causes us to mangle UTF-8
content.

Requests' utils.py:

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    :rtype: str
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = _parse_content_type_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

    if 'application/json' in content_type:
        # Assume UTF-8 based on RFC 4627: https://www.ietf.org/rfc/rfc4627.txt since the charset was unset
        return 'utf-8'
2021-11-29 19:52:37 +00:00
Chris Mayo
43507cf80a Make partial and example URLs in docstrings italic
Prevent Sphinx from turning them into broken links.
2021-08-12 19:28:50 +01:00
Chris Mayo
26c15c5e67 Fix deprecation warning for resolver.query()
/home/travis/build/linkchecker/linkchecker/linkcheck/checker/mailtourl.py:321: DeprecationWarning: please use dns.resolver.resolve() instead
    answers = resolver.query(domain, 'MX')
2020-09-14 19:55:05 +01:00
Chris Mayo
b1faef93c3
Merge pull request #495 from cjmayo/mswindows
MS Windows Python 3.7 and MS Store compatibility
2020-09-01 19:46:44 +01:00
Chris Mayo
314ec085a3
Merge pull request #462 from cjmayo/anchor
Fix anchor checking
2020-09-01 19:39:29 +01:00
Chris Mayo
2fbd49dd0b Replace os.path.splitunc() with os.path.splitdrive()
os.path.splitunc() removed in Python 3.7.

https://docs.python.org/3/whatsnew/3.7.html#api-and-feature-removals
2020-08-29 16:57:57 +01:00
Chris Mayo
37e4981089
Merge pull request #492 from cjmayo/pass
Assorted tidying included unneeded pass statements
2020-08-29 16:55:39 +01:00
Chris Mayo
1f58419322 Remove unneeded pass statements 2020-08-22 17:17:02 +01:00
Chris Mayo
8779c39735 Replace deprecated urllib.parse.split functions 2020-08-22 16:28:53 +01:00
Chris Mayo
1b497389b5
Merge pull request #483 from cjmayo/retryafter
Don't translate "Retry-After" server header field
2020-08-21 16:51:17 +01:00
Chris Mayo
5d83e93829
Merge pull request #475 from cjmayo/iana
Update IANA scripts and ignored schemes
2020-08-18 19:40:35 +01:00
Chris Mayo
0269fd88b0 Merge pull request #473 from cjmayo/valueerror
Fix critical exception when parsing a URL with a ]
2020-08-15 16:51:17 +01:00
Chris Mayo
7ee151ebbf Don't translate "Retry-After" server header field
It is defined in RFC 7231.
2020-08-14 19:29:19 +01:00
Chris Mayo
80763ed1ea Add slack to the list of ignored schemes
slack:// is a way to interact with a local Slack client [1], and is not
something that LinkChecker can check.

[1] https://api.slack.com/reference/deep-linking#client
2020-08-09 17:10:26 +01:00
Chris Mayo
f19fd4f5bc Update IANA scripts and ignored schemes (2020-07-28) 2020-08-09 17:10:26 +01:00
Chris Mayo
d5690203fc Fix critical exception when parsing a URL with a ]
e.g.:
<a href="http://localhost]">square</a>

Causes urllib to raise a ValueError:
  File "/usr/lib/python3.8/site-packages/linkcheck/url.py", line 315, in url_norm
    line: urlparts = list(urllib.parse.urlsplit(url))
    locals:
      urlparts = <not found>
      list = <builtin> <class 'list'>
      urllib = <global> <module 'urllib' from '/usr/lib/python3.8/urllib/__init__.py'>
      urllib.parse = <global> <module 'urllib.parse' from '/usr/lib/python3.8/urllib/parse.py'>
      urllib.parse.urlsplit = <global> <function urlsplit at 0x7f950e699e50>
      url = <local> 'http://localhost]', len = 17
  File "/usr/lib/python3.8/urllib/parse.py", line 440, in urlsplit
    line: raise ValueError("Invalid IPv6 URL")
    locals:
      ValueError = <builtin> <class 'ValueError'>
2020-08-08 16:47:31 +01:00
Chris Mayo
0912e8a2c1 Don't strip the URL fragment from cache key if using AnchorCheck
Else once one URL for a page has been checked, URLs with different
fragments are skipped and not passed to AnchorCheck.

eaa538c ("don't check one url multiple times", 2016-11-09)
2020-07-27 19:25:30 +01:00
Chris Mayo
dee21ee9a0 Fix formatting and typos in docstrings 2020-07-25 16:35:48 +01:00
Chris Mayo
a977e4d712
Merge pull request #444 from cjmayo/isinstance
Remove or replace uses of isinstance()
2020-07-08 19:55:29 +01:00
Chris Mayo
b328520f08 Convert UrlBase syntax Exception to a string
Causes an exception when logging.
2020-07-07 17:25:28 +01:00
Chris Mayo
53bd5c4d21 Remove HttpUrl.getheader() 2020-07-07 17:25:28 +01:00
Chris Mayo
3fcee872b6 urlparts need to support assignment 2020-07-07 17:25:28 +01:00
Chris Mayo
d91a328224 Remove strformat.unicode_safe() and strformat.url_unicode_split()
All strings support Unicode in Python 3.
2020-07-07 17:25:28 +01:00
Chris Mayo
f86e506de4 Remove isinstance() from FileUrl.read_content()
get_index_html() returns a string.
2020-06-18 19:27:06 +01:00
Chris Mayo
36246c15ac Update various comments to https 2020-06-05 16:59:46 +01:00
Chris Mayo
a6b1eb45b1 Convert to Python 3 super() 2020-06-03 20:06:36 +01:00
Chris Mayo
cec9b78f5e Additional review comments on black linkcheck/ 2020-06-03 20:06:36 +01:00
Chris Mayo
b974ec3262 Review comments on black linkcheck/ 2020-06-01 16:07:21 +01:00
Chris Mayo
ac0967e251 Fix remaining flake8 violations in linkcheck/
linkcheck/better_exchook2.py:28:89: E501 line too long (90 > 88 characters)
linkcheck/better_exchook2.py:155:9: E722 do not use bare 'except'
linkcheck/better_exchook2.py:166:9: E722 do not use bare 'except'
linkcheck/better_exchook2.py:289:13: E741 ambiguous variable name 'l'
linkcheck/better_exchook2.py:299:9: E722 do not use bare 'except'
linkcheck/containers.py:48:13: E731 do not assign a lambda expression, use a def
linkcheck/ftpparse.py:123:89: E501 line too long (93 > 88 characters)
linkcheck/loader.py:46:47: E203 whitespace before ':'
linkcheck/logconf.py:45:29: E231 missing whitespace after ','
linkcheck/robotparser2.py:157:89: E501 line too long (95 > 88 characters)
linkcheck/robotparser2.py:182:89: E501 line too long (89 > 88 characters)
linkcheck/strformat.py:181:16: E203 whitespace before ':'
linkcheck/strformat.py:181:43: E203 whitespace before ':'
linkcheck/strformat.py:253:9: E731 do not assign a lambda expression, use a def
linkcheck/strformat.py:254:9: E731 do not assign a lambda expression, use a def
linkcheck/strformat.py:341:89: E501 line too long (111 > 88 characters)
linkcheck/url.py:102:32: E203 whitespace before ':'
linkcheck/url.py:277:5: E741 ambiguous variable name 'l'
linkcheck/url.py:402:5: E741 ambiguous variable name 'l'
linkcheck/checker/__init__.py:203:1: E402 module level import not at top of file
linkcheck/checker/fileurl.py:200:89: E501 line too long (103 > 88 characters)
linkcheck/checker/mailtourl.py:122:60: E203 whitespace before ':'
linkcheck/checker/mailtourl.py:157:89: E501 line too long (96 > 88 characters)
linkcheck/checker/mailtourl.py:190:89: E501 line too long (109 > 88 characters)
linkcheck/checker/mailtourl.py:200:89: E501 line too long (111 > 88 characters)
linkcheck/checker/mailtourl.py:249:89: E501 line too long (106 > 88 characters)
linkcheck/checker/unknownurl.py:226:23: W291 trailing whitespace
linkcheck/checker/urlbase.py:245:89: E501 line too long (101 > 88 characters)
linkcheck/configuration/confparse.py:236:89: E501 line too long (186 > 88 characters)
linkcheck/configuration/confparse.py:247:89: E501 line too long (111 > 88 characters)
linkcheck/configuration/__init__.py:164:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:184:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:190:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:195:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:198:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:435:89: E501 line too long (90 > 88 characters)
linkcheck/director/aggregator.py:45:43: E231 missing whitespace after ','
linkcheck/director/aggregator.py:178:89: E501 line too long (106 > 88 characters)
linkcheck/logger/__init__.py:29:1: E731 do not assign a lambda expression, use a def
linkcheck/logger/__init__.py:108:13: E741 ambiguous variable name 'l'
linkcheck/logger/__init__.py:275:19: F821 undefined name '_'
linkcheck/logger/__init__.py:342:16: F821 undefined name '_'
linkcheck/logger/__init__.py:380:13: F821 undefined name '_'
linkcheck/logger/__init__.py:384:13: F821 undefined name '_'
linkcheck/logger/__init__.py:387:13: F821 undefined name '_'
linkcheck/logger/__init__.py:396:13: F821 undefined name '_'
linkcheck/network/__init__.py:1:1: W391 blank line at end of file
linkcheck/plugins/locationinfo.py:89:9: E731 do not assign a lambda expression, use a def
linkcheck/plugins/locationinfo.py:91:9: E731 do not assign a lambda expression, use a def
linkcheck/plugins/markdowncheck.py:112:89: E501 line too long (111 > 88 characters)
linkcheck/plugins/markdowncheck.py:141:9: E741 ambiguous variable name 'l'
linkcheck/plugins/markdowncheck.py:165:23: E203 whitespace before ':'
linkcheck/plugins/viruscheck.py:95:42: E203 whitespace before ':'
2020-05-30 17:01:36 +01:00
Chris Mayo
8dc2f12b94 Address space-separated strings in linkcheck/ 2020-05-30 17:01:36 +01:00