Commit graph

6174 commits

Author SHA1 Message Date
Marius Gedminas
a311ebb97e Fix doctype tests
I don't think linkchecker actually cares about the document type, so I'm
not sure why we're even testing this...
2020-03-23 10:56:57 +02:00
Chris Mayo
5eaad24641 Use HTTP header encoding for decoding 2020-03-22 19:54:37 +00:00
Chris Mayo
f5ae90e824 Parser threading lock no longer required with Beautiful Soup 2020-03-22 19:54:37 +00:00
Chris Mayo
74d5c68094 Add new tests for URL quoting 2019-10-05 19:38:57 +01:00
Chris Mayo
b7ec71d8cc Always use utf-8 encoding when quoting 2019-10-05 19:38:57 +01:00
Chris Mayo
a9f147c347 Update fileutil.pathencode() because paths are now strings 2019-10-05 19:38:57 +01:00
Chris Mayo
5bb4524a63 Update strformat.ascii_safe() because paths are now strings 2019-10-05 19:38:57 +01:00
Chris Mayo
646e138166 Pass encoding when unquoting
Else non-UTF-8 codes are misinterpreted:

>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
2019-10-05 19:38:57 +01:00
Chris Mayo
153e53ba03 Reuse soup object used for detecting encoding in the HTML parser 2019-10-05 19:38:57 +01:00
Chris Mayo
978042a54e Hide Beautiful Soup soupsieve warning
Shown every time linkchecker is run:

/usr/lib/python3.7/site-packages/bs4/element.py:16: UserWarning: The
soupsieve package is not installed. CSS selectors cannot be used.
  'The soupsieve package is not installed. CSS selectors cannot be used.'
2019-10-05 19:38:57 +01:00
Chris Mayo
30df69c158 Improve pretty printed comments 2019-10-05 19:38:57 +01:00
Chris Mayo
607328d5c5 Support Beautiful Soup line numbers 2019-10-05 19:38:57 +01:00
Chris Mayo
e46fb7fe9c Supports Python 3 Only
Needs miniboa >= 1.0.8 for telnet test on Python 3.7.
Test with older Beautiful Soup without line number support on
Python 3.5.

Resolve tox deprecation warning:
  Matching undeclared envs is deprecated. Be sure all the envs that Tox
  should run are declared in the tox config.
2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf Don't set parser.encoding
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Petr Dlouhý
69d426b36f fix parser encoding tests after change of parser
UnicodeDammit input has to be non-unicode to trigger character set
detection.
2019-07-22 19:59:37 +01:00
Petr Dlouhý
b5111453d8 change test_parse encoding to UTF-8 2019-07-22 19:59:37 +01:00
Petr Dlouhý
2c3c794e52 fix http test after parser change 2019-07-22 19:59:37 +01:00
Petr Dlouhý
0089349760 fix parser tests after parser change 2019-07-22 19:59:37 +01:00
Petr Dlouhý
d6d48b4814 html parser: use name instead of peeking 2019-07-22 19:59:37 +01:00
Petr Dlouhý
51a06d8a1e Remove home-cooked htmlparser and use BeautifulSoup 2019-07-22 19:59:37 +01:00
Petr Dlouhý
d1844a526e add charset tests 2019-07-22 19:59:37 +01:00
Petr Dlouhý
2daf685633 Python3: fix few htmllib problems 2018-01-05 22:48:46 +01:00
Marius Gedminas
205ceb6805
Merge pull request #344 from hroncok/beautifulsoup4-requirement
Require beautifulsoup4 instead of bs4
2020-02-06 12:52:20 +02:00
Miro Hrončok
ff5ebbae69 Require beautifulsoup4 instead of bs4
bs4 is a dummy package managed by the developer of Beautiful Soup to prevent
name squatting. The official name of PyPI’s Beautiful Soup Python package is
beautifulsoup4. The bs4 package ensures that if you type pip install bs4 by
mistake you will end up with Beautiful Soup.

However, for requirements, it's cleaner to use the proper name.
For downstream packaging in Fedora, this avoids the need of packaging
the dummy package.
2020-02-06 10:05:13 +01:00
anarcat
e37dab8a4b
Merge pull request #339 from cjmayo/notafter
Actually fix TypeError when checking https link
2019-11-21 10:33:27 -05:00
Chris Mayo
d3d6638973 Actually fix TypeError when checking https link
The test was added but not the fix in:
ecd06776 ("Fix TypeError when checking https link and test", 2019-11-11)

Which is caught by the new test when run on Python 3:
___________________ TestHttps.test_x509_to_dict__________________
[gw14] linux -- Python 3.6.9 /usr/bin/python3.6
tests/checker/test_https.py:72: in test_x509_to_dict
    self.assertEqual(httputil.x509_to_dict(cert)["notAfter"],
linkcheck/httputil.py:47: in x509_to_dict
    parsedtime = asn1_generaltime_to_seconds(notAfter)
linkcheck/httputil.py:68: in asn1_generaltime_to_seconds
    res = datetime.strptime(timestr, timeformat + 'Z')
E   TypeError: strptime() argument 1 must be str, not bytes
2019-11-19 20:06:10 +00:00
anarcat
c92ab72676
Merge pull request #338 from cjmayo/https
Enable https checking using a test server
2019-11-14 09:38:54 -05:00
Chris Mayo
ecd06776ab Fix TypeError when checking https link and test
File "/usr/lib/python3.7/site-packages/linkcheck/httputil.py", line 68, in asn1_generaltime_to_seconds
    line: res = datetime.strptime(timestr, timeformat + 'Z')
    locals:
      res = <local> None
      datetime = <global> <class 'datetime.datetime'>
      datetime.strptime = <global> <built-in method strptime of type object at 0x7fa39064dda0>
      timestr = <local> b'20191106202117Z'
      timeformat = <local> '%Y%m%d%H%M%S'
TypeError: strptime() argument 1 must be str, not bytes

pyOpenSSL OpenSSL.crypto.X509.get_notAfter() returns bytes:
https://www.pyopenssl.org/en/stable/api/crypto.html#OpenSSL.crypto.X509.get_notAfter
2019-11-11 20:12:25 +00:00
Chris Mayo
dee4be4b1d Enable https checking using a test server
Verification has to be turned off because we are using a
self-signed certificate.
2019-11-11 20:12:25 +00:00
anarcat
5308ec5204
Merge pull request #336 from cjmayo/logdiff
Improve test failure diff
2019-10-29 16:20:26 -04:00
Chris Mayo
2f16152dc8 Improve test failure diff
Some url lines were missing a url prefix while others had a double url
prefix. diff was reporting more url lines as changed than actually had.
Improve formatting by removing newlines from control lines and adding
headings.

Before:

E   AssertionError: http://localhost:46031/tests/checker/data/sitemap.xml
E   ---
E
E   +++
E
E   @@ -1,4 +1,8 @@
E
E   -url http://localhost:46031/tests/checker/data/sitemap.xml
E   +http://www.example.com/
E   +cache key http://www.example.com/
E   +real url http://www.example.com/
E   +valid
E   +url url http://localhost:46031/tests/checker/data/sitemap.xml
E    cache key http://localhost:46031/tests/checker/data/sitemap.xml
E    real url http://localhost:46031/tests/checker/data/sitemap.xml
E    valid

After:

E   AssertionError: http://localhost:44021/tests/checker/data/sitemap.xml
E   --- expected
E   +++ result
E   @@ -2,3 +2,7 @@
E    cache key http://localhost:44021/tests/checker/data/sitemap.xml
E    real url http://localhost:44021/tests/checker/data/sitemap.xml
E    valid
E   +url http://www.example.com/
E   +cache key http://www.example.com/
E   +real url http://www.example.com/
E   +valid
2019-10-29 20:03:08 +00:00
Marius Gedminas
c294a4e6c1
Merge pull request #335 from cjmayo/sitemap
Fix XmlTagUrlParser and make Python 3 compatible
2019-10-29 15:50:49 +02:00
Chris Mayo
ec8b6e09f0 Fix XmlTagUrlParser and make Python 3 compatible
URLs within a sitemap file were not being captured.
2019-10-28 19:20:05 +00:00
Marius Gedminas
8bdd402aed
Merge pull request #333 from linkchecker/fix-clamav-on-py3
Fix test_clamav.py on Python 3
2019-10-25 16:16:23 +03:00
Marius Gedminas
5b2b3613ec
Merge pull request #330 from linkchecker/fix-sitemap
Fix sitemap parser
2019-10-25 16:15:55 +03:00
anarcat
6dcc9dbf9d
Merge pull request #332 from cjmayo/py3pdf
Make PdfParser Python 3 compatible
2019-10-25 08:38:59 -04:00
Marius Gedminas
f9766a2049 Python 3: fix bytes vs strings in viruscheck plugin
Socket communication deals with bytes.

There are probably remaining issues with the viruscheck plugin on
Python 3, we just can't see them because the code is not fully covered
with tests.
2019-10-25 14:24:07 +03:00
Marius Gedminas
65f861901c Fix all Python 3 tox environments
Old pdfminer supports Python 2 only, new pdfminer supports Python 3
only.
2019-10-25 14:20:31 +03:00
Chris Mayo
b2e63663f8 Make PdfParser Python 3 compatible
basestring is not available in Python 3. Ensure all URLs are Unicode.

url_data.get_raw_content() is returning bytes.
2019-10-24 19:57:27 +01:00
Marius Gedminas
011f6c147e
Merge pull request #331 from linkchecker/explain-skips
Explain why these tests are being skipped
2019-10-23 17:59:55 +03:00
Marius Gedminas
606ece0308 Explain why these tests are being skipped
pytest output before this change:

    SKIPPED [3] tests/__init__.py:217: condition: True
    SKIPPED [1] tests/checker/test_news.py:63: condition: True
    SKIPPED [1] tests/checker/test_news.py:41: condition: True
    SKIPPED [1] tests/checker/test_news.py:116: condition: True
    SKIPPED [1] tests/checker/test_news.py:75: condition: True

After:

    SKIPPED [3] tests/__init__.py: disabled for now until some stable news server comes up
    SKIPPED [4] tests/checker/test_news.py: disabled for now until some stable news server comes up
2019-10-23 17:35:31 +03:00
Marius Gedminas
87b504785c Add a regression test for the sitemap parser 2019-10-23 17:30:10 +03:00
Marius Gedminas
a1af1e9717 Fix sitemap parser
PyExpat wants bytes on Python 2.  See #323.
2019-10-23 17:23:23 +03:00
Marius Gedminas
f46151dbf8
Merge pull request #318 from tkfu/docs/fix-install-instructions
Add instructions to install current release tag from git via pip
2019-10-23 09:47:25 +03:00
Marius Gedminas
938467c3ae
Merge pull request #324 from cjmayo/pdfminer
Add pdfminer to tox.ini and dev-requirements.txt to enable pdf test
2019-10-23 09:47:01 +03:00
Marius Gedminas
db3e25e934
Merge pull request #326 from linkchecker/fix-word-maybe
Fix MS Word parser, hopefully
2019-10-22 18:08:46 +03:00
Marius Gedminas
c6de64978c
Merge pull request #325 from linkchecker/type-error-in-robot-parser
Fix TypeError: string arg required in content_allows_robots()
2019-10-22 18:07:31 +03:00
Marius Gedminas
2a748a3f12
Merge pull request #328 from linkchecker/enable-clamav-tests
Enable ClamAV integration tests on Travis CI
2019-10-22 18:06:24 +03:00
anarcat
928594b194
Merge pull request #316 from cjmayo/nodnspath
Don't add linkcheck_dns directory to sys.path
2019-10-22 10:43:23 -04:00
Marius Gedminas
f283894f86 Sudo is needed to stop/start system services 2019-10-22 17:21:53 +03:00