Chris Mayo
3771dd9136
Use parser.feed_soup() instead of parser.feed()
...
Markup is not being passed in pieces to the parser, so simplify the
interface and reduce the state further.
2020-04-08 20:03:35 +01:00
Chris Mayo
40f43ae41c
Create one function to make soup objects
2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06
Replace Parser lineno() and column() methods
...
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
16e6fb2919
Fix incorrect character in FormFinder log message
2020-04-07 19:24:34 +01:00
Chris Mayo
00f940d979
Fix FormFinder callbacks for missing element_text
...
element_text added in:
51a06d8a ("Remove home-cooked htmlparser and use BeautifulSoup",
2019-07-22)
2020-04-07 19:24:34 +01:00
Chris Mayo
fe024fb0c8
Remove unused Parser.debug() method
2020-04-03 19:24:08 +01:00
Chris Mayo
0c5e3bb403
Remove old HtmlParser .gitignore
...
htmlparse.output was a product of the built-in parser.
2020-04-03 19:24:08 +01:00
Chris Mayo
036b900ffc
Remove unused linkcheck.containers classes
2020-04-03 19:24:08 +01:00
Chris Mayo
3ff3d72492
Use BeautifulSoup element attrs directly
2020-04-03 19:24:08 +01:00
Chris Mayo
a7e1e20172
Remove last line and column from Parser
...
Only used for debug log message and not very useful.
2020-04-03 19:24:08 +01:00
Chris Mayo
28701e291a
Remove use of Python 2 unicode() and related u prefixes
...
Several instances for MS Windows left unchanged.
2020-04-01 19:39:50 +01:00
anarcat
cf4e6bb235
Merge pull request #351 from cjmayo/tagsonly
...
Remove support for non-Tag elements from Parser
2020-04-01 12:17:18 -04:00
Chris Mayo
ffa6ac457f
Remove support for non-Tag elements from Parser
...
This change is made because the linkchecker handlers only process
Tags.
The test HtmlPrettyPrinter handler is updated to output element text
because its support for non-Tag elements has been removed. This results
in a number of the existing tests still passing.
2020-03-31 20:10:35 +01:00
Chris Mayo
e7c5f353cd
Remove unused function linkcheck.fileutil.write_file()
...
Doesn't appear to have ever been used.
Causes flake8 error:
linkcheck/fileutil.py:45:9: F821 undefined name 'file'
2020-03-31 19:46:31 +01:00
Chris Mayo
504004d4f0
Use ipaddress in network.iputil.is_valid_ip()
...
ipaddress was introduced in Python 3.3.
2020-03-31 19:46:31 +01:00
Chris Mayo
2eb1424703
Replace deprecated plistlib.readPlistFromBytes() in bookmarks.safari
...
Remove Python 2 code.
plistlib.loads() was added in Python 3.4.
2020-03-31 19:46:31 +01:00
Chris Mayo
0ee4414a60
Replace memoized with functools.lru_cache
2020-03-31 19:46:31 +01:00
Chris Mayo
1255119ca8
Move HtmlPrinter and HtmlPrettyPrinter into tests
2020-03-30 19:32:30 +01:00
Chris Mayo
ce1d669329
Remove unused functions from linkcheck.httputil
...
http_persistent() unused since:
4b818cb4 ("Detect more cases to close the connection, and close response
objects", 2006-09-15)
http_keepalive(), get_content_encoding() unused since:
7b34be59 ("Introduce check plugins, use Python requests for http/s
connections, and some code cleanups and improvements.", 2014-03-01)
2020-03-30 19:32:30 +01:00
Chris Mayo
5b66964afa
Remove unused .charset from checker classes
...
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Chris Mayo
f743be57e8
Remove unused functions from linkcheck.HtmlParser
...
resolve_entities() unused since:
2c000683 ("Remove unused linkcheck.htmlutil.linkname module",
2020-03-30)
set_doctype(), set_encoding() unused since:
51a06d8a ("Remove home-cooked htmlparser and use BeautifulSoup",
2019-07-22)
2020-03-30 19:32:18 +01:00
Chris Mayo
2c000683e1
Remove unused linkcheck.htmlutil.linkname module
...
Unused since:
d6d48b48 ("html parser: use name instead of peeking", 2019-07-22)
2020-03-30 19:31:11 +01:00
Marius Gedminas
af0f50efa8
Restore support for older BeautifulSoup4 versions
2020-03-30 14:49:56 +03:00
Marius Gedminas
a311ebb97e
Fix doctype tests
...
I don't think linkchecker actually cares about the document type, so I'm
not sure why we're even testing this...
2020-03-23 10:56:57 +02:00
Chris Mayo
5eaad24641
Use HTTP header encoding for decoding
2020-03-22 19:54:37 +00:00
Chris Mayo
f5ae90e824
Parser threading lock no longer required with Beautiful Soup
2020-03-22 19:54:37 +00:00
Chris Mayo
b7ec71d8cc
Always use utf-8 encoding when quoting
2019-10-05 19:38:57 +01:00
Chris Mayo
a9f147c347
Update fileutil.pathencode() because paths are now strings
2019-10-05 19:38:57 +01:00
Chris Mayo
5bb4524a63
Update strformat.ascii_safe() because paths are now strings
2019-10-05 19:38:57 +01:00
Chris Mayo
646e138166
Pass encoding when unquoting
...
Else non-UTF-8 codes are misinterpreted:
>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
2019-10-05 19:38:57 +01:00
Chris Mayo
153e53ba03
Reuse soup object used for detecting encoding in the HTML parser
2019-10-05 19:38:57 +01:00
Chris Mayo
978042a54e
Hide Beautiful Soup soupsieve warning
...
Shown every time linkchecker is run:
/usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The
soupsieve package is not installed. CSS selectors cannot be used.
'The soupsieve package is not installed. CSS selectors cannot be used.'
2019-10-05 19:38:57 +01:00
Chris Mayo
30df69c158
Improve pretty printed comments
2019-10-05 19:38:57 +01:00
Chris Mayo
607328d5c5
Support Beautiful Soup line numbers
2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf
Don't set parser.encoding
...
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Petr Dlouhý
b5111453d8
change test_parse encoding to UTF-8
2019-07-22 19:59:37 +01:00
Petr Dlouhý
d6d48b4814
html parser: use name instead of peeking
2019-07-22 19:59:37 +01:00
Petr Dlouhý
51a06d8a1e
Remove home-cooked htmlparser and use BeautifulSoup
2019-07-22 19:59:37 +01:00
Petr Dlouhý
2daf685633
Python3: fix few htmllib problems
2018-01-05 22:48:46 +01:00
Chris Mayo
d3d6638973
Actually fix TypeError when checking https link
...
The test was added but not the fix in:
ecd06776 ("Fix TypeError when checking https link and test", 2019-11-11)
Which is caught by the new test when run on Python 3:
___________________ TestHttps.test_x509_to_dict__________________
[gw14] linux -- Python 3.6.9 /usr/bin/python3.6
tests/checker/test_https.py:72: in test_x509_to_dict
self.assertEqual(httputil.x509_to_dict(cert)["notAfter"],
linkcheck/httputil.py:47: in x509_to_dict
parsedtime = asn1_generaltime_to_seconds(notAfter)
linkcheck/httputil.py:68: in asn1_generaltime_to_seconds
res = datetime.strptime(timestr, timeformat + 'Z')
E TypeError: strptime() argument 1 must be str, not bytes
2019-11-19 20:06:10 +00:00
Chris Mayo
ec8b6e09f0
Fix XmlTagUrlParser and make Python 3 compatible
...
URLs within a sitemap file were not being captured.
2019-10-28 19:20:05 +00:00
Marius Gedminas
8bdd402aed
Merge pull request #333 from linkchecker/fix-clamav-on-py3
...
Fix test_clamav.py on Python 3
2019-10-25 16:16:23 +03:00
Marius Gedminas
5b2b3613ec
Merge pull request #330 from linkchecker/fix-sitemap
...
Fix sitemap parser
2019-10-25 16:15:55 +03:00
Marius Gedminas
f9766a2049
Python 3: fix bytes vs strings in viruscheck plugin
...
Socket communication deals with bytes.
There are probably remaining issues with the viruscheck plugin on
Python 3, we just can't see them because the code is not fully covered
with tests.
2019-10-25 14:24:07 +03:00
Chris Mayo
b2e63663f8
Make PdfParser Python 3 compatible
...
basestring is not available in Python 3. Ensure all URLs are Unicode.
url_data.get_raw_content() is returning bytes.
2019-10-24 19:57:27 +01:00
Marius Gedminas
a1af1e9717
Fix sitemap parser
...
PyExpat wants bytes on Python 2. See #323 .
2019-10-23 17:23:23 +03:00
Marius Gedminas
938467c3ae
Merge pull request #324 from cjmayo/pdfminer
...
Add pdfminer to tox.ini and dev-requirements.txt to enable pdf test
2019-10-23 09:47:01 +03:00
Marius Gedminas
db3e25e934
Merge pull request #326 from linkchecker/fix-word-maybe
...
Fix MS Word parser, hopefully
2019-10-22 18:08:46 +03:00
Marius Gedminas
c6de64978c
Merge pull request #325 from linkchecker/type-error-in-robot-parser
...
Fix TypeError: string arg required in content_allows_robots()
2019-10-22 18:07:31 +03:00
Marius Gedminas
fa32a89d6b
Fix MS Word parser, hopefully
...
MS Word files are binary data, and get_temp_filename() will write them
to disk using open(..., 'wb'), so we want to pass bytes in there, not
Unicode.
See #323 .
2019-10-22 16:39:57 +03:00