Commit graph

85 commits

Author SHA1 Message Date
Chris Mayo
926932411d Only attempt to get rel attribute from link elements 2023-01-17 19:23:29 +00:00
Chris Mayo
2294160a6a Fix minimum version of Beautiful Soup increased to 4.11.0
Since:
6d9061b0 ("Ignore bs4 markup and XML parser warnings", 2022-09-02)
2022-11-30 19:21:06 +00:00
Chris Mayo
b6bc366af0 Run pyupgrade --py37-plus x 2 2022-11-08 19:21:29 +00:00
Stefan Fisk
d2b9723612 Fix srcset parsing
Resolves #631
2022-09-07 21:24:23 +02:00
Chris Mayo
6d9061b00a Ignore bs4 markup and XML parser warnings
XMLParsedAsHTMLWarning: It looks like you're parsing an XML document
using an HTML parser.

MarkupResemblesLocatorWarning: The input looks more like a filename than
markup.

MarkupResemblesLocatorWarning: The input looks more like a URL than
markup.
2022-09-02 19:29:11 +01:00
Koen Van den Wijngaert
900586dc01
Better handling for link rel dns-prefetch and add preconnect support (#536)
preconnect is only DNS checked.

This is allowed even in the Resource Hints Editor's Draft
https://w3c.github.io/resource-hints/#preconnect
2021-12-09 19:38:30 +00:00
Chris Mayo
27f22ae17a Fix treating data: URIs in srcset values as links 2020-08-07 20:04:23 +01:00
Chris Mayo
7ba4053710 Fix critical exception if srcset value ends with a comma
Log a debug message as this is a minor syntax problem, won't stop
LinkChecker parsing strings up to the comma.
2020-08-07 20:04:23 +01:00
Chris Mayo
2f51a9dca0 Improve documentation of authentication 2020-06-23 17:28:31 +01:00
Chris Mayo
a92a684ac4 Run black on linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
a127902607 Replace str_text in asserts 2020-05-19 19:56:42 +01:00
Chris Mayo
a15a2833ca Remove spaces after names in class method definitions
And also nested functions.

This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
1663e10fe7 Remove spaces after names in function definitions
This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
08ddf658bc
Merge pull request #366 from cjmayo/userorpwd
Support login forms with user and/or password
2020-05-13 19:37:44 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
3ace021264 Support login forms with user and/or password 2020-05-13 19:32:25 +01:00
Chris Mayo
44e81d27dd Remove inheriting object
All Python 3 classes are new-style.
2020-05-08 10:45:31 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa Remove u string prefixes 2020-04-30 20:11:59 +01:00
anarcat
ab476fa4bf
Merge pull request #364 from cjmayo/parser5
Stop using HTML handlers and improve login form error handling
2020-04-30 09:28:48 -04:00
Chris Mayo
12a948894b Fix space style in linkcheck/htmlutil/linkparse.py 2020-04-29 20:07:00 +01:00
Chris Mayo
9eed070a73 Stop using HTML handlers
LinkFinder is the only remaining HTML handler therefore no need for
htmlsoup.process_soup() as an independent function or TagFinder as a
base class.
2020-04-29 20:07:00 +01:00
Chris Mayo
4ffdbf2406 Replace MetaRobotsFinder using BeautifulSoup.find() 2020-04-29 20:07:00 +01:00
Chris Mayo
a51f02cf66 Improve error handling and debugging for login form 2020-04-27 18:06:29 +01:00
Chris Mayo
8fc0dcc055 Make matching login form credentials case-sensitive
The keys of the form.data dictionary are case-sensitive and therefore a
KeyError was possible if the configured values are not identical to
the input element name attributes.
2020-04-27 18:06:29 +01:00
Chris Mayo
7a6ef938cc Rename htmlutil.formsearch to htmlutil.loginformsearch
Make it clear that this module has only one specific use.
2020-04-27 18:06:29 +01:00
Marius Gedminas
680783b1ff SWF files are binary data
Should fix #372.
2020-04-27 11:25:37 +03:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
eb3cf28baa Remove support for start_end_element() callback
The LinkFinder handler start_end_element() callback does nothing apart
from call start_element().
2020-04-10 13:51:09 +01:00
Chris Mayo
48b590cf8b Replace FormFinder using BeautifulSoup.find_all()
FormFinder was the only handler that used an end_element() callback and
was therefore a blocker to moving the Parser class to use
BeautifulSoup.find_all()

FormFinder was a specialised handler used to parse a login form at
the start of a session if the user had configured authentication
credentials.
2020-04-10 13:51:05 +01:00
Chris Mayo
02e1c389b2 Remove parser flush() and reset()
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
3771dd9136 Use parser.feed_soup() instead of parser.feed()
Markup is not being passed in pieces to the parser, so simplify the
interface and reduce the state further.
2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06 Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
16e6fb2919 Fix incorrect character in FormFinder log message 2020-04-07 19:24:34 +01:00
Chris Mayo
00f940d979 Fix FormFinder callbacks for missing element_text
element_text added in:
51a06d8a ("Remove home-cooked htmlparser and use BeautifulSoup",
2019-07-22)
2020-04-07 19:24:34 +01:00
Chris Mayo
3ff3d72492 Use BeautifulSoup element attrs directly 2020-04-03 19:24:08 +01:00
Chris Mayo
a7e1e20172 Remove last line and column from Parser
Only used for debug log message and not very useful.
2020-04-03 19:24:08 +01:00
Chris Mayo
28701e291a Remove use of Python 2 unicode() and related u prefixes
Several instances for MS Windows left unchanged.
2020-04-01 19:39:50 +01:00
Chris Mayo
2c000683e1 Remove unused linkcheck.htmlutil.linkname module
Unused since:
d6d48b48 ("html parser: use name instead of peeking", 2019-07-22)
2020-03-30 19:31:11 +01:00
Chris Mayo
607328d5c5 Support Beautiful Soup line numbers 2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf Don't set parser.encoding
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Petr Dlouhý
d6d48b4814 html parser: use name instead of peeking 2019-07-22 19:59:37 +01:00
Petr Dlouhý
51a06d8a1e Remove home-cooked htmlparser and use BeautifulSoup 2019-07-22 19:59:37 +01:00
anarcat
7cfb1136e9
Merge pull request #313 from cjmayo/titlefinder
Remove unused linkparse.TitleFinder
2019-10-07 11:30:10 -04:00
Chris Mayo
127c2272c4 Remove unused linkparse.TitleFinder
Stopped being used with removal of UrlBase.set_title_from_content() in:

7b34be59 ("Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements.", 2014-03-01)
2019-10-05 19:43:33 +01:00
Chris Mayo
5732606c58 Remove urlutil.decode_for_unquote()
Not needed since all content is now being decoded on retrieval.

Added by:
a6643034 ("Python3: decode parts before submitting them to urllib.quote()", 2018-01-05)
2019-10-04 19:37:09 +01:00
anarcat
8c072fa757
Merge pull request #289 from cjmayo/python3_38
{python3_38} Python3: fix linkname.py
2019-09-12 08:39:29 -04:00
Petr Dlouhý
538c4cfeb9 Python3: fix linkname.py 2019-09-11 20:32:33 +01:00
Petr Dlouhý
e10f25b968 fixes for Python 3: fix running problems in Python 3 2019-09-10 19:30:09 +01:00
Petr Dlouhý
2c6411d68e Python3: fix regexp format 2019-04-17 19:50:06 +01:00