Commit graph

299 commits

Author SHA1 Message Date
Chris Mayo
1663e10fe7 Remove spaces after names in function definitions
This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
42de609f8e Make urllib imports Python 3 only 2020-05-14 20:15:28 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
44e81d27dd Remove inheriting object
All Python 3 classes are new-style.
2020-05-08 10:45:31 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa Remove u string prefixes 2020-04-30 20:11:59 +01:00
anarcat
183d483074
Merge pull request #365 from cjmayo/tidyten1
Remove use of the future package
2020-04-26 12:02:30 -04:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
f5e7f3a382 Remove use of the future package
It was providing Python 2 compatibility.
2020-04-15 19:49:16 +01:00
Chris Mayo
40f43ae41c Create one function to make soup objects 2020-04-08 20:03:35 +01:00
Chris Mayo
3ff3d72492 Use BeautifulSoup element attrs directly 2020-04-03 19:24:08 +01:00
Chris Mayo
5b66964afa Remove unused .charset from checker classes
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Chris Mayo
646e138166 Pass encoding when unquoting
Else non-UTF-8 codes are misinterpreted:

>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
2019-10-05 19:38:57 +01:00
Chris Mayo
153e53ba03 Reuse soup object used for detecting encoding in the HTML parser 2019-10-05 19:38:57 +01:00
Chris Mayo
607328d5c5 Support Beautiful Soup line numbers 2019-10-05 19:38:57 +01:00
Chris Mayo
5fc01455b7 Decode content when retrieved, use bs4 to detect encoding if non-Unicode
UrlBase has been modified as follows:
- the "data" variable now holds bytes
- decoded content is stored in a new variable "text"
- functionality from get_content() has been split out into
  get_raw_content() which returns "data" and download_content() which
  calls read_content() and sets the download related variables.
  This allows for subclasses to do their own decoding and parsers to
  use bytes.
2019-09-30 19:46:24 +01:00
Petr Dlouhý
c2af88ad2e Python3: fix for test_telnet in urlbase.py 2019-09-15 19:49:26 +01:00
Petr Dlouhý
e10f25b968 fixes for Python 3: fix running problems in Python 3 2019-09-10 19:30:09 +01:00
Petr Dlouhý
e92b0a9f7b Python3: fix unicode in urlbase 2019-04-25 19:57:45 +01:00
Petr Dlouhý
b3881ce3b5 Python3: fix urlbase, strformat and others 2019-04-25 19:57:45 +01:00
Petr Dlouhý
4acabf5cb5 fix urllib imports 2019-04-09 20:09:35 +01:00
Graham Seaman
233e7dcf68 Allow wayback-format urls without affecting atom 'feed' urls 2017-02-09 11:43:45 +00:00
Bastian Kleineidam
35eb30432e Added some Python3 fixes. 2014-09-12 19:36:30 +02:00
Bastian Kleineidam
0fa7ed2699 Fix empty URL handling. 2014-07-03 23:34:40 +02:00
Bastian Kleineidam
82dd76b0d7 Add PDF link parsing. 2014-04-28 18:13:45 +02:00
Bastian Kleineidam
22caa9367a Refactor recursion checks. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
ce733ae76b Don't check for robots.txt directives in local html files. 2014-03-19 16:33:22 +01:00
Bastian Kleineidam
6437f08277 Display downloaded bytes. 2014-03-14 21:06:10 +01:00
Bastian Kleineidam
c51caf1133 Assertions should be earlier. 2014-03-14 20:26:11 +01:00
Bastian Kleineidam
cfff4c4a84 Disable URL length warning for data: URLs. 2014-03-14 20:24:28 +01:00
Bastian Kleineidam
bca226c293 Fix assertion checking external links; fix tests 2014-03-10 18:23:44 +01:00
Bastian Kleineidam
6b334dc79b Fix URL result caching. 2014-03-08 19:35:10 +01:00
Bastian Kleineidam
fab2c2da98 Improve content type setting. 2014-03-05 20:12:19 +01:00
Bastian Kleineidam
ef13a3fce1 Implement sitemap and sitemap index parsing. 2014-03-05 19:26:37 +01:00
Bastian Kleineidam
b72cf252fb Move parseable check down since it might get the content. 2014-03-05 19:26:05 +01:00
Bastian Kleineidam
9ef65cb774 Fix UrlData string representation. 2014-03-05 19:25:40 +01:00
Bastian Kleineidam
192cfab009 Cleanup of the UrlData.is_* functions 2014-03-05 19:23:16 +01:00
Bastian Kleineidam
978b24f2d7 Merge branch 'caching' 2014-03-04 07:21:42 +01:00
Bastian Kleineidam
f1076c8813 Increase url-too-long warning. 2014-03-03 23:31:04 +01:00
Bastian Kleineidam
82f81241fd Check all links and add better caching. 2014-03-03 23:29:45 +01:00
Bastian Kleineidam
7b34be590b Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements. 2014-03-01 00:12:34 +01:00
Bastian Kleineidam
c806be5c15 Updated copyright 2014-01-08 22:33:04 +01:00
Bastian Kleineidam
0ca63797bf Remove content cache. 2013-12-10 23:41:52 +01:00
Bastian Kleineidam
023da7c993 Remove the duplicate URL content check. 2013-12-04 19:12:40 +01:00
Bastian Kleineidam
64d95e45e0 Remove local HTML and CSS syntax check. 2013-02-08 21:36:02 +01:00
Bastian Kleineidam
e6ad32c028 Catch UnicodeError for invalid host names. 2013-01-23 19:42:29 +01:00
Bastian Kleineidam
7fe72745ae Updated copyright. 2013-01-09 23:03:12 +01:00
Bastian Kleineidam
a5b6136e70 Check word document validity before closing. 2013-01-07 21:58:02 +01:00
Bastian Kleineidam
9820530313 Use better_exchook to print more internal error info. 2012-12-18 23:06:48 +01:00
Bastian Kleineidam
42a17cbb98 Prepare py3 port and display sys.argv on internal errors. 2012-11-26 18:49:07 +01:00