Commit graph

6142 commits

Author SHA1 Message Date
Chris Mayo
ec8b6e09f0 Fix XmlTagUrlParser and make Python 3 compatible
URLs within a sitemap file were not being captured.
2019-10-28 19:20:05 +00:00
Marius Gedminas
8bdd402aed
Merge pull request #333 from linkchecker/fix-clamav-on-py3
Fix test_clamav.py on Python 3
2019-10-25 16:16:23 +03:00
Marius Gedminas
5b2b3613ec
Merge pull request #330 from linkchecker/fix-sitemap
Fix sitemap parser
2019-10-25 16:15:55 +03:00
anarcat
6dcc9dbf9d
Merge pull request #332 from cjmayo/py3pdf
Make PdfParser Python 3 compatible
2019-10-25 08:38:59 -04:00
Marius Gedminas
f9766a2049 Python 3: fix bytes vs strings in viruscheck plugin
Socket communication deals with bytes.

There are probably remaining issues with the viruscheck plugin on
Python 3, we just can't see them because the code is not fully covered
with tests.
2019-10-25 14:24:07 +03:00
Marius Gedminas
65f861901c Fix all Python 3 tox environments
Old pdfminer supports Python 2 only, new pdfminer supports Python 3
only.
2019-10-25 14:20:31 +03:00
Chris Mayo
b2e63663f8 Make PdfParser Python 3 compatible
basestring is not available in Python 3. Ensure all URLs are Unicode.

url_data.get_raw_content() is returning bytes.
2019-10-24 19:57:27 +01:00
Marius Gedminas
011f6c147e
Merge pull request #331 from linkchecker/explain-skips
Explain why these tests are being skipped
2019-10-23 17:59:55 +03:00
Marius Gedminas
606ece0308 Explain why these tests are being skipped
pytest output before this change:

    SKIPPED [3] tests/__init__.py:217: condition: True
    SKIPPED [1] tests/checker/test_news.py:63: condition: True
    SKIPPED [1] tests/checker/test_news.py:41: condition: True
    SKIPPED [1] tests/checker/test_news.py:116: condition: True
    SKIPPED [1] tests/checker/test_news.py:75: condition: True

After:

    SKIPPED [3] tests/__init__.py: disabled for now until some stable news server comes up
    SKIPPED [4] tests/checker/test_news.py: disabled for now until some stable news server comes up
2019-10-23 17:35:31 +03:00
Marius Gedminas
87b504785c Add a regression test for the sitemap parser 2019-10-23 17:30:10 +03:00
Marius Gedminas
a1af1e9717 Fix sitemap parser
PyExpat wants bytes on Python 2.  See #323.
2019-10-23 17:23:23 +03:00
Marius Gedminas
f46151dbf8
Merge pull request #318 from tkfu/docs/fix-install-instructions
Add instructions to install current release tag from git via pip
2019-10-23 09:47:25 +03:00
Marius Gedminas
938467c3ae
Merge pull request #324 from cjmayo/pdfminer
Add pdfminer to tox.ini and dev-requirements.txt to enable pdf test
2019-10-23 09:47:01 +03:00
Marius Gedminas
db3e25e934
Merge pull request #326 from linkchecker/fix-word-maybe
Fix MS Word parser, hopefully
2019-10-22 18:08:46 +03:00
Marius Gedminas
c6de64978c
Merge pull request #325 from linkchecker/type-error-in-robot-parser
Fix TypeError: string arg required in content_allows_robots()
2019-10-22 18:07:31 +03:00
Marius Gedminas
2a748a3f12
Merge pull request #328 from linkchecker/enable-clamav-tests
Enable ClamAV integration tests on Travis CI
2019-10-22 18:06:24 +03:00
anarcat
928594b194
Merge pull request #316 from cjmayo/nodnspath
Don't add linkcheck_dns directory to sys.path
2019-10-22 10:43:23 -04:00
Marius Gedminas
f283894f86 Sudo is needed to stop/start system services 2019-10-22 17:21:53 +03:00
Marius Gedminas
746b66e91e Update the clamav database 2019-10-22 17:17:14 +03:00
Marius Gedminas
2251b23df5 Wait for clamav-daemon to start up 2019-10-22 17:08:25 +03:00
Marius Gedminas
7e94e542b3 Enable clamav integration tests on Travis CI 2019-10-22 17:04:09 +03:00
Marius Gedminas
fa32a89d6b Fix MS Word parser, hopefully
MS Word files are binary data, and get_temp_filename() will write them
to disk using open(..., 'wb'), so we want to pass bytes in there, not
Unicode.

See #323.
2019-10-22 16:39:57 +03:00
Marius Gedminas
58b0d5aaae Fix TypeError: string arg required in content_allows_robots()
See #323 an #317.
2019-10-22 14:13:45 +03:00
Marius Gedminas
6a9ab5ae44 Add a failing test 2019-10-22 14:13:45 +03:00
Chris Mayo
949f84d329 PdfParser requires bytes 2019-10-21 20:12:33 +01:00
Chris Mayo
a31289c97d Add pdfminer to tox.ini and dev-requirements.txt to enable pdf test 2019-10-21 20:06:44 +01:00
Chris Mayo
7da64b16f0 Don't add linkcheck_dns directory to sys.path
This code was added in:
efbbb656 ("Remove python-dns conflict by moving the dns module into a custom subdirectory.", 2012-12-07)

Installation of linkcheck_dns stopped with:
0a13fae3 ("remove third party packages and use them as dependency", 2018-01-06)
2019-10-21 19:52:58 +01:00
Marius Gedminas
bbb90eba81
Merge pull request #321 from linkchecker/wait-for-threads-to-exit
Wait for threads to exit after stopping them
2019-10-21 20:50:04 +03:00
Marius Gedminas
e274d74be2 Wait for threads to exit after stopping them
This fixes a race condition where the main thread would check if any
internal errors happened and get back a 0 while a worker thread was
still busy printing the internal error message before incrementing the
counter.

Fixes #320.

My experiments show that this adds no perceptible delay to the script
runtime (on Linux).  More specifically, there already is an annoying
perceptible delay of about 1 second, but it's not caused by this change.
2019-10-21 18:23:58 +03:00
Marius Gedminas
ade5a5c399
Merge pull request #319 from linkchecker/nonascii-regression
Fix TypeError: string arg required in find_links()
2019-10-21 18:02:16 +03:00
Marius Gedminas
84dbb5d603 Fix TypeError: string arg required in find_links()
Fixes #317.
2019-10-21 17:47:46 +03:00
Marius Gedminas
a4967fe92c Add a regression test for issue #317
The important bit was making the `file_test` helper not ignore internal
errors.
2019-10-21 17:45:18 +03:00
Marius Gedminas
42c75b5ef9 Move some pytest options into pytest.ini
This is so that I can run `tox -- -n 8` to run the tests in parallel, or
`tox -- tests/checker/test_misc.py::TestMisc::test_html5` to run just a
single test, without having to repeat all the other options.

I haven't moved --cov=linkcheck because I don't want coverage results
when I'm limiting the test run to a single test (they just make the
interesting bit -- the test result itself -- scroll up).

I've also added -ra to the default option list because then several
tests fail, I'd like to see a list of their names in one place, not
spead out between the huge tracebacks.
2019-10-21 17:42:29 +03:00
Jon Oster
2e2c81130e Add instructions to install current release tag from git via pip
Signed-off-by: Jon Oster <jon.oster@here.com>
2019-10-21 16:10:26 +02:00
anarcat
895dc016b9
Merge pull request #315 from cjmayo/lessnetwork
Remove unused code from network subpackage
2019-10-20 17:06:06 -04:00
Chris Mayo
c7a32d67fe Remove unused code from network subpackage 2019-10-19 10:27:34 +01:00
anarcat
f73ba54a2a
Merge pull request #308 from cjmayo/decode
Decode content when retrieved
2019-10-10 09:46:32 -04:00
anarcat
7cfb1136e9
Merge pull request #313 from cjmayo/titlefinder
Remove unused linkparse.TitleFinder
2019-10-07 11:30:10 -04:00
anarcat
5a43cfec40
Merge pull request #312 from cjmayo/notneeded
Revert Python 3 patches not needed after decode
2019-10-07 11:29:52 -04:00
Chris Mayo
127c2272c4 Remove unused linkparse.TitleFinder
Stopped being used with removal of UrlBase.set_title_from_content() in:

7b34be59 ("Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements.", 2014-03-01)
2019-10-05 19:43:33 +01:00
Chris Mayo
5732606c58 Remove urlutil.decode_for_unquote()
Not needed since all content is now being decoded on retrieval.

Added by:
a6643034 ("Python3: decode parts before submitting them to urllib.quote()", 2018-01-05)
2019-10-04 19:37:09 +01:00
Chris Mayo
2776eb5f52 Revert "Python3: fix opening file URLs"
This reverts commit 4c9ec511b5.
2019-10-04 19:37:09 +01:00
anarcat
07cf9c1c11
Merge pull request #310 from cjmayo/writeln
Remove unnecessary unicode() from StatusLogger.writeln()
2019-10-01 09:36:25 -04:00
Chris Mayo
c6a06d99ac Remove unnecessary unicode() from StatusLogger.writeln() 2019-09-30 20:06:48 +01:00
Petr Dlouhý
6e8da10942 fixes for Python 3: fix markdowncheck
The translate() method of string objects (and Python 2 Unicode objects)
only accepts a single, table argument.
2019-09-30 19:46:24 +01:00
Chris Mayo
e01ea0d9f0 Safari bookmark parser requires bytes 2019-09-30 19:46:24 +01:00
Chris Mayo
ad33d359c1 Adapt Opera bookmark parser to work with decoded data 2019-09-30 19:46:24 +01:00
Chris Mayo
9460064084 Use requests to decode the content of login form 2019-09-30 19:46:24 +01:00
Chris Mayo
5fc01455b7 Decode content when retrieved, use bs4 to detect encoding if non-Unicode
UrlBase has been modified as follows:
- the "data" variable now holds bytes
- decoded content is stored in a new variable "text"
- functionality from get_content() has been split out into
  get_raw_content() which returns "data" and download_content() which
  calls read_content() and sets the download related variables.
  This allows for subclasses to do their own decoding and parsers to
  use bytes.
2019-09-30 19:46:24 +01:00
Chris Mayo
0c90c718bf Revert "Python3: fix bytes mark in parser/__init__.py"
This reverts commit aec8243348.
2019-09-30 19:46:24 +01:00