Chris Mayo
0795e3c1b4
Replace Parser class using BeautifulSoup.find_all()
2020-04-10 13:51:09 +01:00
Chris Mayo
eb3cf28baa
Remove support for start_end_element() callback
...
The LinkFinder handler start_end_element() callback does nothing apart
from call start_element().
2020-04-10 13:51:09 +01:00
Chris Mayo
c9f17e92b9
Remove support for end_element() callback
2020-04-10 13:51:09 +01:00
Chris Mayo
48b590cf8b
Replace FormFinder using BeautifulSoup.find_all()
...
FormFinder was the only handler that used an end_element() callback and
was therefore a blocker to moving the Parser class to use
BeautifulSoup.find_all()
FormFinder was a specialised handler used to parse a login form at
the start of a session if the user had configured authentication
credentials.
2020-04-10 13:51:05 +01:00
Chris Mayo
974915cc4f
Remove encoding from Parser
...
Only used by the test and an attribute of the soup object.
2020-04-08 20:03:35 +01:00
Chris Mayo
02e1c389b2
Remove parser flush() and reset()
...
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
3771dd9136
Use parser.feed_soup() instead of parser.feed()
...
Markup is not being passed in pieces to the parser, so simplify the
interface and reduce the state further.
2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06
Replace Parser lineno() and column() methods
...
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
514210199d
Add tests for search_form
2020-04-07 19:24:34 +01:00
Chris Mayo
036b900ffc
Remove unused linkcheck.containers classes
2020-04-03 19:24:08 +01:00
Chris Mayo
3ff3d72492
Use BeautifulSoup element attrs directly
2020-04-03 19:24:08 +01:00
Chris Mayo
28701e291a
Remove use of Python 2 unicode() and related u prefixes
...
Several instances for MS Windows left unchanged.
2020-04-01 19:39:50 +01:00
anarcat
cf4e6bb235
Merge pull request #351 from cjmayo/tagsonly
...
Remove support for non-Tag elements from Parser
2020-04-01 12:17:18 -04:00
Chris Mayo
9fc651e82b
Remove Python 2 compatibility from parser tests
2020-03-31 20:10:35 +01:00
Chris Mayo
ffa6ac457f
Remove support for non-Tag elements from Parser
...
This change is made because the linkchecker handlers only process
Tags.
The test HtmlPrettyPrinter handler is updated to output element text
because its support for non-Tag elements has been removed. This results
in a number of the existing tests still passing.
2020-03-31 20:10:35 +01:00
Chris Mayo
0ee4414a60
Replace memoized with functools.lru_cache
2020-03-31 19:46:31 +01:00
Chris Mayo
1255119ca8
Move HtmlPrinter and HtmlPrettyPrinter into tests
2020-03-30 19:32:30 +01:00
Chris Mayo
f743be57e8
Remove unused functions from linkcheck.HtmlParser
...
resolve_entities() unused since:
2c000683 ("Remove unused linkcheck.htmlutil.linkname module",
2020-03-30)
set_doctype(), set_encoding() unused since:
51a06d8a ("Remove home-cooked htmlparser and use BeautifulSoup",
2019-07-22)
2020-03-30 19:32:18 +01:00
Chris Mayo
2c000683e1
Remove unused linkcheck.htmlutil.linkname module
...
Unused since:
d6d48b48 ("html parser: use name instead of peeking", 2019-07-22)
2020-03-30 19:31:11 +01:00
Chris Mayo
74d5c68094
Add new tests for URL quoting
2019-10-05 19:38:57 +01:00
Chris Mayo
b7ec71d8cc
Always use utf-8 encoding when quoting
2019-10-05 19:38:57 +01:00
Chris Mayo
5bb4524a63
Update strformat.ascii_safe() because paths are now strings
2019-10-05 19:38:57 +01:00
Chris Mayo
646e138166
Pass encoding when unquoting
...
Else non-UTF-8 codes are misinterpreted:
>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
2019-10-05 19:38:57 +01:00
Chris Mayo
30df69c158
Improve pretty printed comments
2019-10-05 19:38:57 +01:00
Chris Mayo
607328d5c5
Support Beautiful Soup line numbers
2019-10-05 19:38:57 +01:00
Petr Dlouhý
69d426b36f
fix parser encoding tests after change of parser
...
UnicodeDammit input has to be non-unicode to trigger character set
detection.
2019-07-22 19:59:37 +01:00
Petr Dlouhý
b5111453d8
change test_parse encoding to UTF-8
2019-07-22 19:59:37 +01:00
Petr Dlouhý
2c3c794e52
fix http test after parser change
2019-07-22 19:59:37 +01:00
Petr Dlouhý
0089349760
fix parser tests after parser change
2019-07-22 19:59:37 +01:00
Petr Dlouhý
d6d48b4814
html parser: use name instead of peeking
2019-07-22 19:59:37 +01:00
Petr Dlouhý
d1844a526e
add charset tests
2019-07-22 19:59:37 +01:00
Chris Mayo
ecd06776ab
Fix TypeError when checking https link and test
...
File "/usr/lib/python3.7/site-packages/linkcheck/httputil.py", line 68, in asn1_generaltime_to_seconds
line: res = datetime.strptime(timestr, timeformat + 'Z')
locals:
res = <local> None
datetime = <global> <class 'datetime.datetime'>
datetime.strptime = <global> <built-in method strptime of type object at 0x7fa39064dda0>
timestr = <local> b'20191106202117Z'
timeformat = <local> '%Y%m%d%H%M%S'
TypeError: strptime() argument 1 must be str, not bytes
pyOpenSSL OpenSSL.crypto.X509.get_notAfter() returns bytes:
https://www.pyopenssl.org/en/stable/api/crypto.html#OpenSSL.crypto.X509.get_notAfter
2019-11-11 20:12:25 +00:00
Chris Mayo
dee4be4b1d
Enable https checking using a test server
...
Verification has to be turned off because we are using a
self-signed certificate.
2019-11-11 20:12:25 +00:00
Chris Mayo
2f16152dc8
Improve test failure diff
...
Some url lines were missing a url prefix while others had a double url
prefix. diff was reporting more url lines as changed than actually had.
Improve formatting by removing newlines from control lines and adding
headings.
Before:
E AssertionError: http://localhost:46031/tests/checker/data/sitemap.xml
E ---
E
E +++
E
E @@ -1,4 +1,8 @@
E
E -url http://localhost:46031/tests/checker/data/sitemap.xml
E +http://www.example.com/
E +cache key http://www.example.com/
E +real url http://www.example.com/
E +valid
E +url url http://localhost:46031/tests/checker/data/sitemap.xml
E cache key http://localhost:46031/tests/checker/data/sitemap.xml
E real url http://localhost:46031/tests/checker/data/sitemap.xml
E valid
After:
E AssertionError: http://localhost:44021/tests/checker/data/sitemap.xml
E --- expected
E +++ result
E @@ -2,3 +2,7 @@
E cache key http://localhost:44021/tests/checker/data/sitemap.xml
E real url http://localhost:44021/tests/checker/data/sitemap.xml
E valid
E +url http://www.example.com/
E +cache key http://www.example.com/
E +real url http://www.example.com/
E +valid
2019-10-29 20:03:08 +00:00
Chris Mayo
ec8b6e09f0
Fix XmlTagUrlParser and make Python 3 compatible
...
URLs within a sitemap file were not being captured.
2019-10-28 19:20:05 +00:00
Marius Gedminas
8bdd402aed
Merge pull request #333 from linkchecker/fix-clamav-on-py3
...
Fix test_clamav.py on Python 3
2019-10-25 16:16:23 +03:00
Marius Gedminas
5b2b3613ec
Merge pull request #330 from linkchecker/fix-sitemap
...
Fix sitemap parser
2019-10-25 16:15:55 +03:00
Marius Gedminas
f9766a2049
Python 3: fix bytes vs strings in viruscheck plugin
...
Socket communication deals with bytes.
There are probably remaining issues with the viruscheck plugin on
Python 3, we just can't see them because the code is not fully covered
with tests.
2019-10-25 14:24:07 +03:00
Marius Gedminas
606ece0308
Explain why these tests are being skipped
...
pytest output before this change:
SKIPPED [3] tests/__init__.py:217: condition: True
SKIPPED [1] tests/checker/test_news.py:63: condition: True
SKIPPED [1] tests/checker/test_news.py:41: condition: True
SKIPPED [1] tests/checker/test_news.py:116: condition: True
SKIPPED [1] tests/checker/test_news.py:75: condition: True
After:
SKIPPED [3] tests/__init__.py: disabled for now until some stable news server comes up
SKIPPED [4] tests/checker/test_news.py: disabled for now until some stable news server comes up
2019-10-23 17:35:31 +03:00
Marius Gedminas
87b504785c
Add a regression test for the sitemap parser
2019-10-23 17:30:10 +03:00
Marius Gedminas
c6de64978c
Merge pull request #325 from linkchecker/type-error-in-robot-parser
...
Fix TypeError: string arg required in content_allows_robots()
2019-10-22 18:07:31 +03:00
Marius Gedminas
7e94e542b3
Enable clamav integration tests on Travis CI
2019-10-22 17:04:09 +03:00
Marius Gedminas
58b0d5aaae
Fix TypeError: string arg required in content_allows_robots()
...
See #323 an #317 .
2019-10-22 14:13:45 +03:00
Marius Gedminas
6a9ab5ae44
Add a failing test
2019-10-22 14:13:45 +03:00
Marius Gedminas
84dbb5d603
Fix TypeError: string arg required in find_links()
...
Fixes #317 .
2019-10-21 17:47:46 +03:00
Marius Gedminas
a4967fe92c
Add a regression test for issue #317
...
The important bit was making the `file_test` helper not ignore internal
errors.
2019-10-21 17:45:18 +03:00
Chris Mayo
c7a32d67fe
Remove unused code from network subpackage
2019-10-19 10:27:34 +01:00
anarcat
bae4282c92
Merge pull request #307 from cjmayo/cgi_escape
...
Replace deprecated cgi.escape
2019-09-18 10:16:58 -04:00
Chris Mayo
53cd9475b5
Replace deprecated cgi.escape
...
html provided for Python 2 by future
https://python-future.org/compatible_idioms.html#html-escaping-and-entities
2019-09-17 20:25:05 +01:00
Petr Dlouhý
1b41df4af3
Python3: fix test error message
2019-09-17 20:20:46 +01:00