Commit graph

513 commits

Author SHA1 Message Date
Marius Gedminas
f268a90cfb
Merge branch 'master' into HandleRateLimiting 2020-05-23 14:15:52 +03:00
Marius Gedminas
5bd1fb4e36 Fix internal error on empty HTML files
When BeautifulSoup finds an empty file on disk, it sets
original_encoding to None.  It doesn't matter what encoding we pick for
empty files, so let's just pick one.

I don't know if there are any circumstances where BeautifulSoup might
set the encoding to None for a non-empty file.

Closes #392.
2020-05-21 19:01:33 +03:00
Chris Mayo
96e1c00ff7 TestLogger diff output is all Unicode in Python 3 2020-05-20 19:58:44 +01:00
Chris Mayo
71eaf9a982 Remove str_text from tests/ 2020-05-19 19:56:42 +01:00
Chris Mayo
a127902607 Replace str_text in asserts 2020-05-19 19:56:42 +01:00
Chris Mayo
12fd59057e Remove duplicate tests from test_strformat.py 2020-05-17 20:10:28 +01:00
Chris Mayo
339d293326 Convert tests/test_po.py to UTF-8 2020-05-17 20:10:28 +01:00
Chris Mayo
04465530c4 Use HttpServerTest.get_url() 2020-05-17 20:10:28 +01:00
Chris Mayo
58dbe1f282 Remove unused import pytest from tests/checker/test_http.py
pytest.mark.xfail() removed in:
743a5f31 ("Crawl HTML attributes in deterministic order", 2017-02-01)
2020-05-17 20:10:28 +01:00
Chris Mayo
79eafee826 Add a test for VirusCheck 2020-05-17 19:04:49 +01:00
Chris Mayo
a15a2833ca Remove spaces after names in class method definitions
And also nested functions.

This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
1663e10fe7 Remove spaces after names in function definitions
This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
fc11d08968 Remove spaces after names in class definitions 2020-05-16 20:19:42 +01:00
Chris Mayo
1416a08119 On Python 3 no need to convert os.linesep to a string 2020-05-16 17:02:01 +01:00
Chris Mayo
10552a79c7 Remove LinkCheckTest.fail_unicode()
No need to encode Python 3 strings before output.
2020-05-16 17:02:00 +01:00
Chris Mayo
9f95d06a39 Remove Python 2 test.test_support import 2020-05-16 16:26:38 +01:00
Chris Mayo
f8c9faec1b Remove Python 2 cStringIO imports 2020-05-15 19:37:04 +01:00
Chris Mayo
bda9612273 Make html.escape Python 3 only 2020-05-14 20:15:28 +01:00
Chris Mayo
42de609f8e Make urllib imports Python 3 only 2020-05-14 20:15:28 +01:00
Chris Mayo
08ddf658bc
Merge pull request #366 from cjmayo/userorpwd
Support login forms with user and/or password
2020-05-13 19:37:44 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
00c4a30386 Add user and password only loginurl tests 2020-05-13 19:32:29 +01:00
Chris Mayo
31a9f68c46
Merge pull request #367 from cjmayo/loginurl
Add test for loginurl
2020-05-12 20:08:57 +01:00
Chris Mayo
44e81d27dd Remove inheriting object
All Python 3 classes are new-style.
2020-05-08 10:45:31 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa Remove u string prefixes 2020-04-30 20:11:59 +01:00
anarcat
ab476fa4bf
Merge pull request #364 from cjmayo/parser5
Stop using HTML handlers and improve login form error handling
2020-04-30 09:28:48 -04:00
Chris Mayo
1d1d9c3bde Add testing for variants of the robots meta directive 2020-04-29 20:14:10 +01:00
Chris Mayo
9eed070a73 Stop using HTML handlers
LinkFinder is the only remaining HTML handler therefore no need for
htmlsoup.process_soup() as an independent function or TagFinder as a
base class.
2020-04-29 20:07:00 +01:00
Chris Mayo
a1433767e5 Replace HtmlPrettyPrinter with pretty_print_html() 2020-04-29 20:07:00 +01:00
Chris Mayo
0361d9e0e8 Remove encoding and default fd from HtmlPrettyPrinter
Neither are used.
2020-04-29 20:07:00 +01:00
Chris Mayo
4ffdbf2406 Replace MetaRobotsFinder using BeautifulSoup.find() 2020-04-29 20:07:00 +01:00
Chris Mayo
8fc0dcc055 Make matching login form credentials case-sensitive
The keys of the form.data dictionary are case-sensitive and therefore a
KeyError was possible if the configured values are not identical to
the input element name attributes.
2020-04-27 18:06:29 +01:00
Chris Mayo
7a6ef938cc Rename htmlutil.formsearch to htmlutil.loginformsearch
Make it clear that this module has only one specific use.
2020-04-27 18:06:29 +01:00
anarcat
183d483074
Merge pull request #365 from cjmayo/tidyten1
Remove use of the future package
2020-04-26 12:02:30 -04:00
Chris Mayo
3b8af403be Add test for loginurl
A new cgi-bin directory is created to identify the scripts to be run by
http.server.CGIHTTPRequestHandler.
2020-04-19 19:05:55 +01:00
Chris Mayo
56b8c9f7ab Add tests for <meta name="robots" content="nofollow">
norobots.html was used for testing <meta name="robots"
content="nofollow"> in local files until [1]. This commit reinstates
local file testing and adds an http test.

Checking is reported by checker.httpurl.HttpUrl.content_allows_robots().

[1] ce733ae7 ("Don't check for robots.txt directives in local html
files.", 2014-03-19)
2020-04-18 20:30:46 +01:00
Chris Mayo
d189445a8e LinkFinder does not raise StopParse 2020-04-18 20:30:46 +01:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
a83fbb56c0 Remove from __future__ imports 2020-04-15 19:49:16 +01:00
Chris Mayo
f5e7f3a382 Remove use of the future package
It was providing Python 2 compatibility.
2020-04-15 19:49:16 +01:00
Chris Mayo
0795e3c1b4 Replace Parser class using BeautifulSoup.find_all() 2020-04-10 13:51:09 +01:00
Chris Mayo
eb3cf28baa Remove support for start_end_element() callback
The LinkFinder handler start_end_element() callback does nothing apart
from call start_element().
2020-04-10 13:51:09 +01:00
Chris Mayo
c9f17e92b9 Remove support for end_element() callback 2020-04-10 13:51:09 +01:00
Chris Mayo
48b590cf8b Replace FormFinder using BeautifulSoup.find_all()
FormFinder was the only handler that used an end_element() callback and
was therefore a blocker to moving the Parser class to use
BeautifulSoup.find_all()

FormFinder was a specialised handler used to parse a login form at
the start of a session if the user had configured authentication
credentials.
2020-04-10 13:51:05 +01:00
Chris Mayo
974915cc4f Remove encoding from Parser
Only used by the test and an attribute of the soup object.
2020-04-08 20:03:35 +01:00
Chris Mayo
02e1c389b2 Remove parser flush() and reset()
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
3771dd9136 Use parser.feed_soup() instead of parser.feed()
Markup is not being passed in pieces to the parser, so simplify the
interface and reduce the state further.
2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06 Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
514210199d Add tests for search_form 2020-04-07 19:24:34 +01:00