Commit graph

3273 commits

Author SHA1 Message Date
Chris Mayo
18f20d592f Check for KDE 5 proxy first and then KDE 4
Don't look for kde4-config in case a KDE 5 user still has it installed.
2020-07-07 17:06:25 +01:00
Chris Mayo
bd55c2ef8f Compare KDE proxy ReversedException integer value to zero 2020-07-07 17:06:25 +01:00
Chris Mayo
da22d4886b
Merge pull request #441 from cjmayo/authentication
Improve documentation of authentication
2020-06-23 17:35:19 +01:00
Chris Mayo
085ae188f7 Remove checks for empty loginpasswordfield and loginuserfield
These have default values and cannot be reset.
2020-06-23 17:28:31 +01:00
Chris Mayo
1ec3848720 Log problem with login form without exception 2020-06-23 17:28:31 +01:00
Chris Mayo
2f51a9dca0 Improve documentation of authentication 2020-06-23 17:28:31 +01:00
Chris Mayo
d66e64460c Remove unused code from strformat.py 2020-06-18 19:31:00 +01:00
Chris Mayo
1f77506c9f Remove isinstance() in url.url_fix_mailto_urlsplit()
urls are strings.
2020-06-18 19:27:06 +01:00
Chris Mayo
8f9f687ed8 Remove isinstance() from fileutil.path_safe()
paths are derived from urls which are strings.
2020-06-18 19:27:06 +01:00
Chris Mayo
f86e506de4 Remove isinstance() from FileUrl.read_content()
get_index_html() returns a string.
2020-06-18 19:27:06 +01:00
Chris Mayo
3231730366 Remove isinstance() from robotparser2.py
Originally for encoding Python 2 Unicode strings [1]. Will not be used
in Python 3 because the variables are strings, if they were bytes
exceptions would be raised.

[1] c97f68f7 ("accept unicode in robots.txt can_fetch", 2004-11-09)
2020-06-18 19:27:06 +01:00
Chris Mayo
9c9a3d8b14 Remove isinstance() from url.idna_encode()
Was originally used for Python 2 Unicode strings.
f4b73c6d ("Python3: fix unicode in url.py", 2018-01-05)
2020-06-18 19:27:06 +01:00
Chris Mayo
3a6540bc46 Replace isinstance() in strformat.ascii_safe() 2020-06-18 19:27:06 +01:00
Chris Mayo
4009039158
Merge pull request #420 from cjmayo/dconf
Update GNOME proxy support for GNOME 3 and Python 3
2020-06-14 18:56:19 +01:00
Chris Mayo
b6004fb6b1 Simplify and add debug messages to KDE proxy retrieval 2020-06-08 17:00:10 +01:00
Chris Mayo
29b292c90f Replace KDE 3 proxy support with KDE 5 support
KDE 3 was superseded in 2008.

KDE 4 uses: ${HOME}/.kde4/share/config/kioslaverc
KDE 5 (Kubuntu) uses: ${HOME}/.config/kioslaverc

Default ReversedException is false
2020-06-08 17:00:10 +01:00
Chris Mayo
9108afeee5 Add html.escape on URLs in logger/html.py 2020-06-05 16:59:46 +01:00
Chris Mayo
eeb5fa48ca Update configuration/confparse.py log message to https 2020-06-05 16:59:46 +01:00
Chris Mayo
0191b021f4 Make configuration/confparse.py log message translatable 2020-06-05 16:59:46 +01:00
Chris Mayo
36246c15ac Update various comments to https 2020-06-05 16:59:46 +01:00
Chris Mayo
3bd790c22d Update W3C validator links to use https 2020-06-05 16:59:46 +01:00
Chris Mayo
b987d6f3ca Fix indent in plugins/locationinfo.py 2020-06-05 16:59:46 +01:00
Chris Mayo
4330b8a59e Replace codecs.open() with open() 2020-06-05 16:59:46 +01:00
Chris Mayo
b9c8e33878 Update GNOME proxy support for GNOME 3 and Python 3
GConf is replaced by dconf and the GSettings API in GNOME 3.
2020-06-05 16:29:45 +01:00
Chris Mayo
e207ac54ce
Merge pull request #437 from cjmayo/translate
Update man page translation and fixes for application translation process
2020-06-05 16:17:06 +01:00
Chris Mayo
1632a1ce26 Fix xgettext Non-ASCII error when translating
xgettext: Non-ASCII character at
../linkcheck/plugins/markdowncheck.py:2.
          Please specify the source encoding through --from-code or through a comment
          as specified in https://www.python.org/peps/pep-0263.html.

make: *** [Makefile:25: linkchecker.pot] Error 1
2020-06-05 16:06:01 +01:00
Chris Mayo
d591fedb60 Remove unused updater code that supports linkchecker-gui
pip provides update support for linkchecker.
2020-06-05 16:05:25 +01:00
Chris Mayo
a6b1eb45b1 Convert to Python 3 super() 2020-06-03 20:06:36 +01:00
Chris Mayo
cec9b78f5e Additional review comments on black linkcheck/ 2020-06-03 20:06:36 +01:00
Chris Mayo
6b3cb18546 Restore better_exchook2.py and colorama.py to pre-Black state
These files are based on published packages.

better_exchook2.py was derived from better_exchook.py in:
https://pypi.org/project/better_exchook/

colorama.py was derived from win32.py in:
https://pypi.org/project/colorama/

Files modified in:
a92a684a ("Run black on linkcheck/", 2020-05-30)
2020-06-03 20:06:36 +01:00
Chris Mayo
b974ec3262 Review comments on black linkcheck/ 2020-06-01 16:07:21 +01:00
Chris Mayo
ac0967e251 Fix remaining flake8 violations in linkcheck/
linkcheck/better_exchook2.py:28:89: E501 line too long (90 > 88 characters)
linkcheck/better_exchook2.py:155:9: E722 do not use bare 'except'
linkcheck/better_exchook2.py:166:9: E722 do not use bare 'except'
linkcheck/better_exchook2.py:289:13: E741 ambiguous variable name 'l'
linkcheck/better_exchook2.py:299:9: E722 do not use bare 'except'
linkcheck/containers.py:48:13: E731 do not assign a lambda expression, use a def
linkcheck/ftpparse.py:123:89: E501 line too long (93 > 88 characters)
linkcheck/loader.py:46:47: E203 whitespace before ':'
linkcheck/logconf.py:45:29: E231 missing whitespace after ','
linkcheck/robotparser2.py:157:89: E501 line too long (95 > 88 characters)
linkcheck/robotparser2.py:182:89: E501 line too long (89 > 88 characters)
linkcheck/strformat.py:181:16: E203 whitespace before ':'
linkcheck/strformat.py:181:43: E203 whitespace before ':'
linkcheck/strformat.py:253:9: E731 do not assign a lambda expression, use a def
linkcheck/strformat.py:254:9: E731 do not assign a lambda expression, use a def
linkcheck/strformat.py:341:89: E501 line too long (111 > 88 characters)
linkcheck/url.py:102:32: E203 whitespace before ':'
linkcheck/url.py:277:5: E741 ambiguous variable name 'l'
linkcheck/url.py:402:5: E741 ambiguous variable name 'l'
linkcheck/checker/__init__.py:203:1: E402 module level import not at top of file
linkcheck/checker/fileurl.py:200:89: E501 line too long (103 > 88 characters)
linkcheck/checker/mailtourl.py:122:60: E203 whitespace before ':'
linkcheck/checker/mailtourl.py:157:89: E501 line too long (96 > 88 characters)
linkcheck/checker/mailtourl.py:190:89: E501 line too long (109 > 88 characters)
linkcheck/checker/mailtourl.py:200:89: E501 line too long (111 > 88 characters)
linkcheck/checker/mailtourl.py:249:89: E501 line too long (106 > 88 characters)
linkcheck/checker/unknownurl.py:226:23: W291 trailing whitespace
linkcheck/checker/urlbase.py:245:89: E501 line too long (101 > 88 characters)
linkcheck/configuration/confparse.py:236:89: E501 line too long (186 > 88 characters)
linkcheck/configuration/confparse.py:247:89: E501 line too long (111 > 88 characters)
linkcheck/configuration/__init__.py:164:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:184:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:190:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:195:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:198:9: E266 too many leading '#' for block comment
linkcheck/configuration/__init__.py:435:89: E501 line too long (90 > 88 characters)
linkcheck/director/aggregator.py:45:43: E231 missing whitespace after ','
linkcheck/director/aggregator.py:178:89: E501 line too long (106 > 88 characters)
linkcheck/logger/__init__.py:29:1: E731 do not assign a lambda expression, use a def
linkcheck/logger/__init__.py:108:13: E741 ambiguous variable name 'l'
linkcheck/logger/__init__.py:275:19: F821 undefined name '_'
linkcheck/logger/__init__.py:342:16: F821 undefined name '_'
linkcheck/logger/__init__.py:380:13: F821 undefined name '_'
linkcheck/logger/__init__.py:384:13: F821 undefined name '_'
linkcheck/logger/__init__.py:387:13: F821 undefined name '_'
linkcheck/logger/__init__.py:396:13: F821 undefined name '_'
linkcheck/network/__init__.py:1:1: W391 blank line at end of file
linkcheck/plugins/locationinfo.py:89:9: E731 do not assign a lambda expression, use a def
linkcheck/plugins/locationinfo.py:91:9: E731 do not assign a lambda expression, use a def
linkcheck/plugins/markdowncheck.py:112:89: E501 line too long (111 > 88 characters)
linkcheck/plugins/markdowncheck.py:141:9: E741 ambiguous variable name 'l'
linkcheck/plugins/markdowncheck.py:165:23: E203 whitespace before ':'
linkcheck/plugins/viruscheck.py:95:42: E203 whitespace before ':'
2020-05-30 17:01:36 +01:00
Chris Mayo
8dc2f12b94 Address space-separated strings in linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
b9f4864d9e Remove unnecessary commas before closing brackets in linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
a92a684ac4 Run black on linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
abdb160413 Remove unused bookmarks code that supports linkcheck-gui
linkchecker does not need to find a bookmark file, it is given the URL.
Most bookmarks are detected by their MIME type, Firefox is different
because it uses a SQLite database.
2020-05-28 19:44:53 +01:00
Chris Mayo
e204182acb Remove unused httputil.has_header_value() 2020-05-28 19:44:53 +01:00
Chris Mayo
4d2449bb13
Merge pull request #425 from cjmayo/xdg_config_home
Fix xdg_config_home import in bookmarks/chrome.py
2020-05-28 19:18:21 +01:00
Chris Mayo
75349e4dc9 Fix xdg_config_home import in bookmarks/chrome.py 2020-05-27 20:02:07 +01:00
Chris Mayo
a49f42b617 Remove unused mem.py 2020-05-27 20:01:57 +01:00
Chris Mayo
488e72c81f Ignore imports providing aliases in subpackages 2020-05-26 19:49:59 +01:00
Chris Mayo
97f50e8be1 Remove unused import htmlsoup from checker/httpurl.py
Unused since:

f7337f55 ("Fix error due to an empty html file accessed over http", 2020-05-23)
2020-05-25 19:50:57 +01:00
Chris Mayo
3473656fe1 Replace import of distutils.spawn.find_executable with shutil.which 2020-05-25 19:50:57 +01:00
Chris Mayo
6dda2f9669 Move imports to the top of files to resolve flake8 E402 2020-05-25 19:50:57 +01:00
Chris Mayo
0f3444e906 Drop run-time requests version check
Requests 2.4.0 was released in 2014.
2020-05-25 19:50:57 +01:00
Chris Mayo
89c7c74bcf Remove unused set_linecache() from better_exchook2.py 2020-05-25 19:50:57 +01:00
Chris Mayo
7257e5e1a0 Remove unused imports in parser/__init__.py 2020-05-25 19:50:57 +01:00
Chris Mayo
313a14ff0d Remove instances of Python 2 unicode 2020-05-24 19:14:47 +01:00
Marius Gedminas
d0169c46d4
Merge pull request #348 from weshaggard/HandleRateLimiting
Turn status code 429 into warning instead of failure
2020-05-24 16:16:56 +03:00
Marius Gedminas
dcafa2df75
Avoid u-prefixed strings
linkchecker is Python 3 only, all strings are unicode.
2020-05-24 14:50:07 +03:00
Chris Mayo
03b1c4919d Record encoding in debug log messages 2020-05-23 20:01:24 +01:00
Chris Mayo
f7337f55e8 Fix error due to an empty html file accessed over http
Use the already fixed [1] UrlBase.get_content() in HttpUrl.

[1] 5bd1fb4 ("Fix internal error on empty HTML files", 2020-05-21)
2020-05-23 20:01:24 +01:00
Marius Gedminas
f268a90cfb
Merge branch 'master' into HandleRateLimiting 2020-05-23 14:15:52 +03:00
Marius Gedminas
6dffacf17f
Merge pull request #409 from linkchecker/fix-login-timeouts
Make sure login form fetching uses a timeout and sends User-Agent
2020-05-22 21:40:48 +03:00
Marius Gedminas
b0435b3d47 Make sure login form fetching uses a timeout
Also resolve an XXX comment about the User-Agent header (which is
configured in new_request_session), but add a couple of XXX comments
about using proxy and possibly disabling TLS certificate checking.
2020-05-22 11:19:51 +03:00
Marius Gedminas
4f3fe5e1c3 Make sure fetching robots.txt uses the configured timeout
Closes #396.
2020-05-22 10:53:33 +03:00
Marius Gedminas
c60d7c66e4 Clarify the decision to fall back to Latin-1 2020-05-21 19:35:39 +03:00
Marius Gedminas
5bd1fb4e36 Fix internal error on empty HTML files
When BeautifulSoup finds an empty file on disk, it sets
original_encoding to None.  It doesn't matter what encoding we pick for
empty files, so let's just pick one.

I don't know if there are any circumstances where BeautifulSoup might
set the encoding to None for a non-empty file.

Closes #392.
2020-05-21 19:01:33 +03:00
Chris Mayo
6cfc8eeb49 Replace threading.Thread.setName() with setting the name property
As recommended in:

https://docs.python.org/3.5/library/threading.html#threading.Thread.setName
2020-05-20 19:58:44 +01:00
Chris Mayo
42eba19a7d No need to encode url in Checker.check_url_data()
Was causing b'' in log messages e.g. CheckThread-b'http:...
2020-05-20 19:58:44 +01:00
Chris Mayo
28f4587dfa Remove str_text from fileutil.py, strformat.py and url.py 2020-05-19 19:56:42 +01:00
Chris Mayo
ebcc3c4961 Remove str_text from plugins/ 2020-05-19 19:56:42 +01:00
Chris Mayo
1c14583535 Remove str_text from logger/ 2020-05-19 19:56:42 +01:00
Chris Mayo
6bddd4ac60 Remove str_text from checker/ 2020-05-19 19:56:42 +01:00
Chris Mayo
a127902607 Replace str_text in asserts 2020-05-19 19:56:42 +01:00
Chris Mayo
7490804e2c
Merge pull request #395 from cjmayo/tidyten11
Remove unused code from linkcheck/fileutil.py
2020-05-19 19:45:08 +01:00
Marius Gedminas
e6e969f975
Merge pull request #391 from linkchecker/dev-version
Bump version in git to 10.0.0.dev0
2020-05-19 18:49:34 +03:00
Chris Mayo
690605c519 Remove unused code from linkcheck/fileutil.py 2020-05-18 19:29:55 +01:00
Marius Gedminas
5317347e54 Avoid distutils.version.StrictVersion
distutils.version is old code that predates PEP 440.  We could add a
dependency on https://packaging.pypa.io/en/latest/version/, but meh.
2020-05-17 21:12:43 +03:00
Marius Gedminas
bb53aaa621 Fix viruscheck plugin
The clamav interface needs bytes, not unicode.

It would be nice if we had tests for this code.
2020-05-17 17:50:11 +01:00
Chris Mayo
a15a2833ca Remove spaces after names in class method definitions
And also nested functions.

This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
1663e10fe7 Remove spaces after names in function definitions
This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
fc11d08968 Remove spaces after names in class definitions 2020-05-16 20:19:42 +01:00
Chris Mayo
1416a08119 On Python 3 no need to convert os.linesep to a string 2020-05-16 17:02:01 +01:00
Chris Mayo
0752408a44 Remove Python 2 use of sys.stdout in i18n.get_encoded_writer() 2020-05-16 17:02:00 +01:00
Chris Mayo
2c2e7e55ac Remove CSVLogger.encode_row_s()
Introduced during Python 3 conversion to maintaint Python 2 support:

55a7973b ("Python3: fix csvlog", 2016-12-04)
2020-05-16 17:02:00 +01:00
Chris Mayo
ed13a926d3 Remove setting Python 2 xmlparser.returns_unicode 2020-05-16 17:02:00 +01:00
Chris Mayo
025637b08d Remove Python 2 cookielib import 2020-05-16 16:26:38 +01:00
Chris Mayo
1e277444f4 Remove Python 2 thread import 2020-05-16 16:26:34 +01:00
Chris Mayo
dcbddfe045 Remove Python 2 ConfigParser import 2020-05-15 19:37:04 +01:00
Chris Mayo
f8c9faec1b Remove Python 2 cStringIO imports 2020-05-15 19:37:04 +01:00
Chris Mayo
bda9612273 Make html.escape Python 3 only 2020-05-14 20:15:28 +01:00
Chris Mayo
42de609f8e Make urllib imports Python 3 only 2020-05-14 20:15:28 +01:00
Chris Mayo
3c661a83d0 Replace parse_host_port() in checker.proxysupport with url.splitport() 2020-05-14 20:15:28 +01:00
Chris Mayo
c80002437e Update run-time version check 2020-05-13 19:50:19 +01:00
Chris Mayo
08ddf658bc
Merge pull request #366 from cjmayo/userorpwd
Support login forms with user and/or password
2020-05-13 19:37:44 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
3ace021264 Support login forms with user and/or password 2020-05-13 19:32:25 +01:00
Chris Mayo
44e81d27dd Remove inheriting object
All Python 3 classes are new-style.
2020-05-08 10:45:31 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Marius Gedminas
22b0165b72 Make _Logger an abstract base class
The __metaclass__ syntax is a Python-2-ism.  It was replaced with

    class _Logger (object, metaclass=abc.ABCMeta):

in Python 3.  And then Python 3.4 introduced abc.ABC which is an empty
class that has ABCMeta as the metaclass, making it simpler to define
abstract base classes.
2020-04-30 23:09:42 +03:00
Chris Mayo
4d3e5abcfa Remove u string prefixes 2020-04-30 20:11:59 +01:00
anarcat
ab476fa4bf
Merge pull request #364 from cjmayo/parser5
Stop using HTML handlers and improve login form error handling
2020-04-30 09:28:48 -04:00
Chris Mayo
12a948894b Fix space style in linkcheck/htmlutil/linkparse.py 2020-04-29 20:07:00 +01:00
Chris Mayo
9eed070a73 Stop using HTML handlers
LinkFinder is the only remaining HTML handler therefore no need for
htmlsoup.process_soup() as an independent function or TagFinder as a
base class.
2020-04-29 20:07:00 +01:00
Chris Mayo
4ffdbf2406 Replace MetaRobotsFinder using BeautifulSoup.find() 2020-04-29 20:07:00 +01:00
Chris Mayo
a51f02cf66 Improve error handling and debugging for login form 2020-04-27 18:06:29 +01:00
Chris Mayo
9a33c2a659 Make requesting login form password work on Python 3 2020-04-27 18:06:29 +01:00
Chris Mayo
8fc0dcc055 Make matching login form credentials case-sensitive
The keys of the form.data dictionary are case-sensitive and therefore a
KeyError was possible if the configured values are not identical to
the input element name attributes.
2020-04-27 18:06:29 +01:00
Chris Mayo
7a6ef938cc Rename htmlutil.formsearch to htmlutil.loginformsearch
Make it clear that this module has only one specific use.
2020-04-27 18:06:29 +01:00
anarcat
350f8bfef9
Merge pull request #373 from linkchecker/fix-swf-parsing
SWF files are binary data
2020-04-27 09:39:52 -04:00
Marius Gedminas
680783b1ff SWF files are binary data
Should fix #372.
2020-04-27 11:25:37 +03:00
anarcat
183d483074
Merge pull request #365 from cjmayo/tidyten1
Remove use of the future package
2020-04-26 12:02:30 -04:00
Chris Mayo
d189445a8e LinkFinder does not raise StopParse 2020-04-18 20:30:46 +01:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
384e1e196d Remove Python 2 gettext builtin installation 2020-04-15 19:49:16 +01:00
Chris Mayo
a83fbb56c0 Remove from __future__ imports 2020-04-15 19:49:16 +01:00
Chris Mayo
f5e7f3a382 Remove use of the future package
It was providing Python 2 compatibility.
2020-04-15 19:49:16 +01:00
Chris Mayo
0795e3c1b4 Replace Parser class using BeautifulSoup.find_all() 2020-04-10 13:51:09 +01:00
Chris Mayo
eb3cf28baa Remove support for start_end_element() callback
The LinkFinder handler start_end_element() callback does nothing apart
from call start_element().
2020-04-10 13:51:09 +01:00
Chris Mayo
c9f17e92b9 Remove support for end_element() callback 2020-04-10 13:51:09 +01:00
Chris Mayo
48b590cf8b Replace FormFinder using BeautifulSoup.find_all()
FormFinder was the only handler that used an end_element() callback and
was therefore a blocker to moving the Parser class to use
BeautifulSoup.find_all()

FormFinder was a specialised handler used to parse a login form at
the start of a session if the user had configured authentication
credentials.
2020-04-10 13:51:05 +01:00
Chris Mayo
974915cc4f Remove encoding from Parser
Only used by the test and an attribute of the soup object.
2020-04-08 20:03:35 +01:00
Chris Mayo
02e1c389b2 Remove parser flush() and reset()
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
3771dd9136 Use parser.feed_soup() instead of parser.feed()
Markup is not being passed in pieces to the parser, so simplify the
interface and reduce the state further.
2020-04-08 20:03:35 +01:00
Chris Mayo
40f43ae41c Create one function to make soup objects 2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06 Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
16e6fb2919 Fix incorrect character in FormFinder log message 2020-04-07 19:24:34 +01:00
Chris Mayo
00f940d979 Fix FormFinder callbacks for missing element_text
element_text added in:
51a06d8a ("Remove home-cooked htmlparser and use BeautifulSoup",
2019-07-22)
2020-04-07 19:24:34 +01:00
Chris Mayo
fe024fb0c8 Remove unused Parser.debug() method 2020-04-03 19:24:08 +01:00
Chris Mayo
0c5e3bb403 Remove old HtmlParser .gitignore
htmlparse.output was a product of the built-in parser.
2020-04-03 19:24:08 +01:00
Chris Mayo
036b900ffc Remove unused linkcheck.containers classes 2020-04-03 19:24:08 +01:00
Chris Mayo
3ff3d72492 Use BeautifulSoup element attrs directly 2020-04-03 19:24:08 +01:00
Chris Mayo
a7e1e20172 Remove last line and column from Parser
Only used for debug log message and not very useful.
2020-04-03 19:24:08 +01:00
Chris Mayo
28701e291a Remove use of Python 2 unicode() and related u prefixes
Several instances for MS Windows left unchanged.
2020-04-01 19:39:50 +01:00
anarcat
cf4e6bb235
Merge pull request #351 from cjmayo/tagsonly
Remove support for non-Tag elements from Parser
2020-04-01 12:17:18 -04:00
Chris Mayo
ffa6ac457f Remove support for non-Tag elements from Parser
This change is made because the linkchecker handlers only process
Tags.

The test HtmlPrettyPrinter handler is updated to output element text
because its support for non-Tag elements has been removed. This results
in a number of the existing tests still passing.
2020-03-31 20:10:35 +01:00
Chris Mayo
e7c5f353cd Remove unused function linkcheck.fileutil.write_file()
Doesn't appear to have ever been used.

Causes flake8 error:
linkcheck/fileutil.py:45:9: F821 undefined name 'file'
2020-03-31 19:46:31 +01:00
Chris Mayo
504004d4f0 Use ipaddress in network.iputil.is_valid_ip()
ipaddress was introduced in Python 3.3.
2020-03-31 19:46:31 +01:00
Chris Mayo
2eb1424703 Replace deprecated plistlib.readPlistFromBytes() in bookmarks.safari
Remove Python 2 code.

plistlib.loads() was added in Python 3.4.
2020-03-31 19:46:31 +01:00
Chris Mayo
0ee4414a60 Replace memoized with functools.lru_cache 2020-03-31 19:46:31 +01:00
Chris Mayo
1255119ca8 Move HtmlPrinter and HtmlPrettyPrinter into tests 2020-03-30 19:32:30 +01:00
Chris Mayo
ce1d669329 Remove unused functions from linkcheck.httputil
http_persistent() unused since:
4b818cb4 ("Detect more cases to close the connection, and close response
objects", 2006-09-15)

http_keepalive(), get_content_encoding() unused since:
7b34be59 ("Introduce check plugins, use Python requests for http/s
connections, and some code cleanups and improvements.", 2014-03-01)
2020-03-30 19:32:30 +01:00
Chris Mayo
5b66964afa Remove unused .charset from checker classes
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Chris Mayo
f743be57e8 Remove unused functions from linkcheck.HtmlParser
resolve_entities() unused since:
2c000683 ("Remove unused linkcheck.htmlutil.linkname module",
2020-03-30)

set_doctype(), set_encoding() unused since:
51a06d8a ("Remove home-cooked htmlparser and use BeautifulSoup",
2019-07-22)
2020-03-30 19:32:18 +01:00
Chris Mayo
2c000683e1 Remove unused linkcheck.htmlutil.linkname module
Unused since:
d6d48b48 ("html parser: use name instead of peeking", 2019-07-22)
2020-03-30 19:31:11 +01:00
Marius Gedminas
af0f50efa8 Restore support for older BeautifulSoup4 versions 2020-03-30 14:49:56 +03:00
Wes Haggard
dcdc64e878 Turn status code 429 into warning instead of failure 2020-03-25 16:36:08 -07:00
Marius Gedminas
a311ebb97e Fix doctype tests
I don't think linkchecker actually cares about the document type, so I'm
not sure why we're even testing this...
2020-03-23 10:56:57 +02:00
Chris Mayo
5eaad24641 Use HTTP header encoding for decoding 2020-03-22 19:54:37 +00:00
Chris Mayo
f5ae90e824 Parser threading lock no longer required with Beautiful Soup 2020-03-22 19:54:37 +00:00
Chris Mayo
d3d6638973 Actually fix TypeError when checking https link
The test was added but not the fix in:
ecd06776 ("Fix TypeError when checking https link and test", 2019-11-11)

Which is caught by the new test when run on Python 3:
___________________ TestHttps.test_x509_to_dict__________________
[gw14] linux -- Python 3.6.9 /usr/bin/python3.6
tests/checker/test_https.py:72: in test_x509_to_dict
    self.assertEqual(httputil.x509_to_dict(cert)["notAfter"],
linkcheck/httputil.py:47: in x509_to_dict
    parsedtime = asn1_generaltime_to_seconds(notAfter)
linkcheck/httputil.py:68: in asn1_generaltime_to_seconds
    res = datetime.strptime(timestr, timeformat + 'Z')
E   TypeError: strptime() argument 1 must be str, not bytes
2019-11-19 20:06:10 +00:00
Chris Mayo
ec8b6e09f0 Fix XmlTagUrlParser and make Python 3 compatible
URLs within a sitemap file were not being captured.
2019-10-28 19:20:05 +00:00
Marius Gedminas
8bdd402aed
Merge pull request #333 from linkchecker/fix-clamav-on-py3
Fix test_clamav.py on Python 3
2019-10-25 16:16:23 +03:00
Marius Gedminas
5b2b3613ec
Merge pull request #330 from linkchecker/fix-sitemap
Fix sitemap parser
2019-10-25 16:15:55 +03:00
Marius Gedminas
f9766a2049 Python 3: fix bytes vs strings in viruscheck plugin
Socket communication deals with bytes.

There are probably remaining issues with the viruscheck plugin on
Python 3, we just can't see them because the code is not fully covered
with tests.
2019-10-25 14:24:07 +03:00
Chris Mayo
b2e63663f8 Make PdfParser Python 3 compatible
basestring is not available in Python 3. Ensure all URLs are Unicode.

url_data.get_raw_content() is returning bytes.
2019-10-24 19:57:27 +01:00
Marius Gedminas
a1af1e9717 Fix sitemap parser
PyExpat wants bytes on Python 2.  See #323.
2019-10-23 17:23:23 +03:00
Marius Gedminas
938467c3ae
Merge pull request #324 from cjmayo/pdfminer
Add pdfminer to tox.ini and dev-requirements.txt to enable pdf test
2019-10-23 09:47:01 +03:00
Marius Gedminas
db3e25e934
Merge pull request #326 from linkchecker/fix-word-maybe
Fix MS Word parser, hopefully
2019-10-22 18:08:46 +03:00
Marius Gedminas
c6de64978c
Merge pull request #325 from linkchecker/type-error-in-robot-parser
Fix TypeError: string arg required in content_allows_robots()
2019-10-22 18:07:31 +03:00
Marius Gedminas
fa32a89d6b Fix MS Word parser, hopefully
MS Word files are binary data, and get_temp_filename() will write them
to disk using open(..., 'wb'), so we want to pass bytes in there, not
Unicode.

See #323.
2019-10-22 16:39:57 +03:00
Marius Gedminas
58b0d5aaae Fix TypeError: string arg required in content_allows_robots()
See #323 an #317.
2019-10-22 14:13:45 +03:00
Chris Mayo
949f84d329 PdfParser requires bytes 2019-10-21 20:12:33 +01:00
Chris Mayo
7da64b16f0 Don't add linkcheck_dns directory to sys.path
This code was added in:
efbbb656 ("Remove python-dns conflict by moving the dns module into a custom subdirectory.", 2012-12-07)

Installation of linkcheck_dns stopped with:
0a13fae3 ("remove third party packages and use them as dependency", 2018-01-06)
2019-10-21 19:52:58 +01:00
Marius Gedminas
e274d74be2 Wait for threads to exit after stopping them
This fixes a race condition where the main thread would check if any
internal errors happened and get back a 0 while a worker thread was
still busy printing the internal error message before incrementing the
counter.

Fixes #320.

My experiments show that this adds no perceptible delay to the script
runtime (on Linux).  More specifically, there already is an annoying
perceptible delay of about 1 second, but it's not caused by this change.
2019-10-21 18:23:58 +03:00
Marius Gedminas
84dbb5d603 Fix TypeError: string arg required in find_links()
Fixes #317.
2019-10-21 17:47:46 +03:00
Chris Mayo
c7a32d67fe Remove unused code from network subpackage 2019-10-19 10:27:34 +01:00
anarcat
f73ba54a2a
Merge pull request #308 from cjmayo/decode
Decode content when retrieved
2019-10-10 09:46:32 -04:00
anarcat
7cfb1136e9
Merge pull request #313 from cjmayo/titlefinder
Remove unused linkparse.TitleFinder
2019-10-07 11:30:10 -04:00
Chris Mayo
127c2272c4 Remove unused linkparse.TitleFinder
Stopped being used with removal of UrlBase.set_title_from_content() in:

7b34be59 ("Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements.", 2014-03-01)
2019-10-05 19:43:33 +01:00
Chris Mayo
b7ec71d8cc Always use utf-8 encoding when quoting 2019-10-05 19:38:57 +01:00
Chris Mayo
a9f147c347 Update fileutil.pathencode() because paths are now strings 2019-10-05 19:38:57 +01:00
Chris Mayo
5bb4524a63 Update strformat.ascii_safe() because paths are now strings 2019-10-05 19:38:57 +01:00
Chris Mayo
646e138166 Pass encoding when unquoting
Else non-UTF-8 codes are misinterpreted:

>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
2019-10-05 19:38:57 +01:00
Chris Mayo
153e53ba03 Reuse soup object used for detecting encoding in the HTML parser 2019-10-05 19:38:57 +01:00
Chris Mayo
978042a54e Hide Beautiful Soup soupsieve warning
Shown every time linkchecker is run:

/usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The
soupsieve package is not installed. CSS selectors cannot be used.
  'The soupsieve package is not installed. CSS selectors cannot be used.'
2019-10-05 19:38:57 +01:00
Chris Mayo
30df69c158 Improve pretty printed comments 2019-10-05 19:38:57 +01:00
Chris Mayo
607328d5c5 Support Beautiful Soup line numbers 2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf Don't set parser.encoding
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Chris Mayo
5732606c58 Remove urlutil.decode_for_unquote()
Not needed since all content is now being decoded on retrieval.

Added by:
a6643034 ("Python3: decode parts before submitting them to urllib.quote()", 2018-01-05)
2019-10-04 19:37:09 +01:00
Chris Mayo
2776eb5f52 Revert "Python3: fix opening file URLs"
This reverts commit 4c9ec511b5.
2019-10-04 19:37:09 +01:00
Chris Mayo
c6a06d99ac Remove unnecessary unicode() from StatusLogger.writeln() 2019-09-30 20:06:48 +01:00
Petr Dlouhý
6e8da10942 fixes for Python 3: fix markdowncheck
The translate() method of string objects (and Python 2 Unicode objects)
only accepts a single, table argument.
2019-09-30 19:46:24 +01:00
Chris Mayo
e01ea0d9f0 Safari bookmark parser requires bytes 2019-09-30 19:46:24 +01:00
Chris Mayo
ad33d359c1 Adapt Opera bookmark parser to work with decoded data 2019-09-30 19:46:24 +01:00
Chris Mayo
9460064084 Use requests to decode the content of login form 2019-09-30 19:46:24 +01:00
Chris Mayo
5fc01455b7 Decode content when retrieved, use bs4 to detect encoding if non-Unicode
UrlBase has been modified as follows:
- the "data" variable now holds bytes
- decoded content is stored in a new variable "text"
- functionality from get_content() has been split out into
  get_raw_content() which returns "data" and download_content() which
  calls read_content() and sets the download related variables.
  This allows for subclasses to do their own decoding and parsers to
  use bytes.
2019-09-30 19:46:24 +01:00
Chris Mayo
0c90c718bf Revert "Python3: fix bytes mark in parser/__init__.py"
This reverts commit aec8243348.
2019-09-30 19:46:24 +01:00
Chris Mayo
53cd9475b5 Replace deprecated cgi.escape
html provided for Python 2 by future
https://python-future.org/compatible_idioms.html#html-escaping-and-entities
2019-09-17 20:25:05 +01:00
anarcat
1590408a65
Merge pull request #306 from cjmayo/python3_49
{python3_49} enable and fix remaining bookmark tests
2019-09-16 15:18:26 -04:00
Petr Dlouhý
eaa7131523 enable and fix remaining bookmark tests
biplist module preferred for reading Safari bookmarks in
bookmarks/safari.py so install it for tox testing.
2019-09-16 20:08:01 +01:00
anarcat
4ccf0fb2d0
Merge pull request #305 from cjmayo/python3_48
{python3_48} Python3: fix displaying help
2019-09-16 10:10:36 -04:00
anarcat
2c7573b3b8
Merge pull request #300 from cjmayo/python3_43
{python3_43} Python3: fix for test_telnet in urlbase.py
2019-09-16 10:08:18 -04:00
anarcat
bec68f237b
Merge pull request #299 from cjmayo/python3_42
{python3_42} fixes for Python 3: fix telneturl
2019-09-16 10:07:55 -04:00
anarcat
27d672c78b
Merge pull request #297 from cjmayo/python3_40
{python3_40} Python3: fixes form checker/__init__.py
2019-09-16 10:06:05 -04:00
anarcat
5a0a02ae74
Merge pull request #294 from cjmayo/python3_39_alt
{python3_39_alt} Python3: fix TypeError in HttpUrl.read_content()
2019-09-16 10:04:23 -04:00
Petr Dlouhý
14e19efe07 Python3: fix displaying help 2019-09-15 19:50:05 +01:00
Petr Dlouhý
c2af88ad2e Python3: fix for test_telnet in urlbase.py 2019-09-15 19:49:26 +01:00
Petr Dlouhý
a2e67af7b4 fixes for Python 3: fix telneturl 2019-09-15 19:49:18 +01:00
Petr Dlouhý
bb542b00e9 Python3: fixes form checker/__init__.py 2019-09-15 19:49:00 +01:00
Chris Mayo
06fdd78f91 Python3: fix TypeError in HttpUrl.read_content()
From test_http_redirect:

  File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content
    line: buf.write(data)
    locals:
      buf = <local> <_io.StringIO object at 0x7f8fe2f45e10>
      buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10>
      data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n'
TypeError: string argument expected, got 'bytes'
2019-09-15 19:42:29 +01:00
anarcat
736d2a786d
Merge pull request #293 from cjmayo/python3_37_alt
{python3_37_alt} Python3: fix TypeError when parsing cookie data
2019-09-14 11:51:26 -04:00
anarcat
fe39db4fbf
Merge pull request #287 from cjmayo/python3_36
{python3_36} fixes for Python 3 + Travis test: fix cgi
2019-09-14 11:50:53 -04:00
Chris Mayo
a7b7e31917 Python3: fix TypeError when parsing cookie data
>       fp = BytesIO(strheader)
E       TypeError: a bytes-like object is required, not 'str'

linkcheck/cookies.py:61: TypeError

The email package provides the message_from_string() convenience
function which avoids the need to create a file-like object.
Indeed http.client.HTTPMessage is implemented using email.message.Message.
2019-09-13 20:10:25 +01:00
Petr Dlouhý
36465112d0 fixes for Python 3 + Travis test: fix cgi 2019-09-13 19:46:13 +01:00
anarcat
aaa8cb675e
Merge pull request #291 from cjmayo/python3_33_alt
{python3_33_alt} Python3: fix opening file URLs
2019-09-13 10:31:20 -04:00
anarcat
80b62a3e21
Merge pull request #292 from cjmayo/lc_cgi_error
Fix errors caused by logging LCFormError exceptions
2019-09-13 09:12:05 -04:00
anarcat
b0b392f7cc
Merge pull request #282 from cjmayo/python3_31
{python3_31} Python3: fix strformat strline()
2019-09-13 09:11:33 -04:00
Chris Mayo
6dc25547d5 Fix errors caused by logging LCFormError exceptions 2019-09-12 20:13:08 +01:00