Commit graph

329 commits

Author SHA1 Message Date
Chris Mayo
e6da68b7f6 Add linting with Pylint to build workflow 2023-05-03 19:24:53 +01:00
Chris Mayo
8065c75c4e Convert some printf-style strings 2022-11-08 19:21:29 +00:00
Chris Mayo
b6bc366af0 Run pyupgrade --py37-plus x 2 2022-11-08 19:21:29 +00:00
Chris Mayo
0bb1576887 Run pyupgrade --py37-plus --keep-percent-format 2022-11-08 19:21:29 +00:00
Chris Mayo
eab2fa410e Log robots.txt as the sitemap parent URL
This is the location the sitemap URL was found in. The line being
reported is the line in robots.txt.
2022-10-17 19:21:03 +01:00
Nathan Arthur
a29750c57f Fix anchor comments in UrlBase
Parent url query not stripped since:
4a0c63aa ("Fix joining of URLs when parent URL has CGI parameter.", 2011-02-08)
2022-10-03 19:33:05 +01:00
Chris Mayo
52b9881820 Separate URL encoding and content encoding
Ensure users of url_data.encoding are using the URL encoding.

Combined since:
5fc01455 ("Decode content when retrieved, use bs4 to detect encoding if non-Unicode", 2019-09-30)
2022-09-29 19:21:11 +01:00
Lukas Pirl
8c959589c3
add option to ignore specific errors for specific URLs 2022-09-25 22:52:04 +02:00
Chris Mayo
c79bc07cee Add MIME type application/vnd.adobe.flash.movie 2022-09-02 19:29:11 +01:00
Chris Mayo
d6936ceb91 Add warning url-content-type-unparseable 2022-09-02 19:29:11 +01:00
Kian-Meng Ang
a70ea9ea14 Fix typos
Found via `codespell ./linkcheck/ ./tests ./doc/man/en -L bu,noone,fo,pres,shttp`
2022-09-02 17:20:02 +08:00
Chris Mayo
43507cf80a Make partial and example URLs in docstrings italic
Prevent Sphinx from turning them into broken links.
2021-08-12 19:28:50 +01:00
Chris Mayo
314ec085a3
Merge pull request #462 from cjmayo/anchor
Fix anchor checking
2020-09-01 19:39:29 +01:00
Chris Mayo
8779c39735 Replace deprecated urllib.parse.split functions 2020-08-22 16:28:53 +01:00
Chris Mayo
0912e8a2c1 Don't strip the URL fragment from cache key if using AnchorCheck
Else once one URL for a page has been checked, URLs with different
fragments are skipped and not passed to AnchorCheck.

eaa538c ("don't check one url multiple times", 2016-11-09)
2020-07-27 19:25:30 +01:00
Chris Mayo
dee21ee9a0 Fix formatting and typos in docstrings 2020-07-25 16:35:48 +01:00
Chris Mayo
b328520f08 Convert UrlBase syntax Exception to a string
Causes an exception when logging.
2020-07-07 17:25:28 +01:00
Chris Mayo
3fcee872b6 urlparts need to support assignment 2020-07-07 17:25:28 +01:00
Chris Mayo
d91a328224 Remove strformat.unicode_safe() and strformat.url_unicode_split()
All strings support Unicode in Python 3.
2020-07-07 17:25:28 +01:00
Chris Mayo
b974ec3262 Review comments on black linkcheck/ 2020-06-01 16:07:21 +01:00
Chris Mayo
ac0967e251 Fix remaining flake8 violations in linkcheck/
linkcheck/better_exchook2.py:28:89: E501 line too long (90 > 88 characters)
linkcheck/better_exchook2.py:155:9: E722 do not use bare 'except'
linkcheck/better_exchook2.py:166:9: E722 do not use bare 'except'
linkcheck/better_exchook2.py:289:13: E741 ambiguous variable name 'l'
linkcheck/better_exchook2.py:299:9: E722 do not use bare 'except'
linkcheck/containers.py:48:13: E731 do not assign a lambda expression, use a def
linkcheck/ftpparse.py:123:89: E501 line too long (93 > 88 characters)
linkcheck/loader.py:46:47: E203 whitespace before ':'
linkcheck/logconf.py:45:29: E231 missing whitespace after ','
linkcheck/robotparser2.py:157:89: E501 line too long (95 > 88 characters)
linkcheck/robotparser2.py:182:89: E501 line too long (89 > 88 characters)
linkcheck/strformat.py:181:16: E203 whitespace before ':'
linkcheck/strformat.py:181:43: E203 whitespace before ':'
linkcheck/strformat.py:253:9: E731 do not assign a lambda expression, use a def
linkcheck/strformat.py:254:9: E731 do not assign a lambda expression, use a def
linkcheck/strformat.py:341:89: E501 line too long (111 > 88 characters)
linkcheck/url.py:102:32: E203 whitespace before ':'
linkcheck/url.py:277:5: E741 ambiguous variable name 'l'
linkcheck/url.py:402:5: E741 ambiguous variable name 'l'
linkcheck/checker/__init__.py:203:1: E402 module level import not at top of file
linkcheck/checker/fileurl.py:200:89: E501 line too long (103 > 88 characters)
linkcheck/checker/mailtourl.py:122:60: E203 whitespace before ':'
linkcheck/checker/mailtourl.py:157:89: E501 line too long (96 > 88 characters)
linkcheck/checker/mailtourl.py:190:89: E501 line too long (109 > 88 characters)
linkcheck/checker/mailtourl.py:200:89: E501 line too long (111 > 88 characters)
linkcheck/checker/mailtourl.py:249:89: E501 line too long (106 > 88 characters)
linkcheck/checker/unknownurl.py:226:23: W291 trailing whitespace
linkcheck/checker/urlbase.py:245:89: E501 line too long (101 > 88 characters)
linkcheck/configuration/confparse.py:236:89: E501 line too long (186 > 88 characters)
linkcheck/configuration/confparse.py:247:89: E501 line too long (111 > 88 characters)
linkcheck/configuration/__init__.py:164:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:184:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:190:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:195:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:198:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:435:89: E501 line too long (90 > 88 characters)
linkcheck/director/aggregator.py:45:43: E231 missing whitespace after ','
linkcheck/director/aggregator.py:178:89: E501 line too long (106 > 88 characters)
linkcheck/logger/__init__.py:29:1: E731 do not assign a lambda expression, use a def
linkcheck/logger/__init__.py:108:13: E741 ambiguous variable name 'l'
linkcheck/logger/__init__.py:275:19: F821 undefined name '_'
linkcheck/logger/__init__.py:342:16: F821 undefined name '_'
linkcheck/logger/__init__.py:380:13: F821 undefined name '_'
linkcheck/logger/__init__.py:384:13: F821 undefined name '_'
linkcheck/logger/__init__.py:387:13: F821 undefined name '_'
linkcheck/logger/__init__.py:396:13: F821 undefined name '_'
linkcheck/network/__init__.py:1:1: W391 blank line at end of file
linkcheck/plugins/locationinfo.py:89:9: E731 do not assign a lambda expression, use a def
linkcheck/plugins/locationinfo.py:91:9: E731 do not assign a lambda expression, use a def
linkcheck/plugins/markdowncheck.py:112:89: E501 line too long (111 > 88 characters)
linkcheck/plugins/markdowncheck.py:141:9: E741 ambiguous variable name 'l'
linkcheck/plugins/markdowncheck.py:165:23: E203 whitespace before ':'
linkcheck/plugins/viruscheck.py:95:42: E203 whitespace before ':'
2020-05-30 17:01:36 +01:00
Chris Mayo
8dc2f12b94 Address space-separated strings in linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
a92a684ac4 Run black on linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
03b1c4919d Record encoding in debug log messages 2020-05-23 20:01:24 +01:00
Chris Mayo
f7337f55e8 Fix error due to an empty html file accessed over http
Use the already fixed [1] UrlBase.get_content() in HttpUrl.

[1] 5bd1fb4 ("Fix internal error on empty HTML files", 2020-05-21)
2020-05-23 20:01:24 +01:00
Marius Gedminas
c60d7c66e4 Clarify the decision to fall back to Latin-1 2020-05-21 19:35:39 +03:00
Marius Gedminas
5bd1fb4e36 Fix internal error on empty HTML files
When BeautifulSoup finds an empty file on disk, it sets
original_encoding to None.  It doesn't matter what encoding we pick for
empty files, so let's just pick one.

I don't know if there are any circumstances where BeautifulSoup might
set the encoding to None for a non-empty file.

Closes #392.
2020-05-21 19:01:33 +03:00
Chris Mayo
6bddd4ac60 Remove str_text from checker/ 2020-05-19 19:56:42 +01:00
Chris Mayo
a127902607 Replace str_text in asserts 2020-05-19 19:56:42 +01:00
Chris Mayo
a15a2833ca Remove spaces after names in class method definitions
And also nested functions.

This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
1663e10fe7 Remove spaces after names in function definitions
This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
42de609f8e Make urllib imports Python 3 only 2020-05-14 20:15:28 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
44e81d27dd Remove inheriting object
All Python 3 classes are new-style.
2020-05-08 10:45:31 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa Remove u string prefixes 2020-04-30 20:11:59 +01:00
anarcat
183d483074
Merge pull request #365 from cjmayo/tidyten1
Remove use of the future package
2020-04-26 12:02:30 -04:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
f5e7f3a382 Remove use of the future package
It was providing Python 2 compatibility.
2020-04-15 19:49:16 +01:00
Chris Mayo
40f43ae41c Create one function to make soup objects 2020-04-08 20:03:35 +01:00
Chris Mayo
3ff3d72492 Use BeautifulSoup element attrs directly 2020-04-03 19:24:08 +01:00
Chris Mayo
5b66964afa Remove unused .charset from checker classes
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Chris Mayo
646e138166 Pass encoding when unquoting
Else non-UTF-8 codes are misinterpreted:

>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
2019-10-05 19:38:57 +01:00
Chris Mayo
153e53ba03 Reuse soup object used for detecting encoding in the HTML parser 2019-10-05 19:38:57 +01:00
Chris Mayo
607328d5c5 Support Beautiful Soup line numbers 2019-10-05 19:38:57 +01:00
Chris Mayo
5fc01455b7 Decode content when retrieved, use bs4 to detect encoding if non-Unicode
UrlBase has been modified as follows:
- the "data" variable now holds bytes
- decoded content is stored in a new variable "text"
- functionality from get_content() has been split out into
  get_raw_content() which returns "data" and download_content() which
  calls read_content() and sets the download related variables.
  This allows for subclasses to do their own decoding and parsers to
  use bytes.
2019-09-30 19:46:24 +01:00
Petr Dlouhý
c2af88ad2e Python3: fix for test_telnet in urlbase.py 2019-09-15 19:49:26 +01:00
Petr Dlouhý
e10f25b968 fixes for Python 3: fix running problems in Python 3 2019-09-10 19:30:09 +01:00
Petr Dlouhý
e92b0a9f7b Python3: fix unicode in urlbase 2019-04-25 19:57:45 +01:00
Petr Dlouhý
b3881ce3b5 Python3: fix urlbase, strformat and others 2019-04-25 19:57:45 +01:00