Commit graph

107 commits

Author SHA1 Message Date
Chris Mayo
a92a684ac4 Run black on linkcheck/ 2020-05-30 17:01:36 +01:00
Chris Mayo
488e72c81f Ignore imports providing aliases in subpackages 2020-05-26 19:49:59 +01:00
Chris Mayo
7257e5e1a0 Remove unused imports in parser/__init__.py 2020-05-25 19:50:57 +01:00
Chris Mayo
1663e10fe7 Remove spaces after names in function definitions
This is a PEP 8 convention, E211.
2020-05-16 20:19:42 +01:00
Chris Mayo
ed13a926d3 Remove setting Python 2 xmlparser.returns_unicode 2020-05-16 17:02:00 +01:00
Chris Mayo
736c893707
Merge pull request #377 from cjmayo/tidyten3
Remove u string prefixes
2020-05-13 19:36:54 +01:00
Chris Mayo
44e81d27dd Remove inheriting object
All Python 3 classes are new-style.
2020-05-08 10:45:31 +01:00
Chris Mayo
b0ea72e8c1 Remove # -*- coding: lines
Except for tests that include non-unicode characters:

tests/test_po.py
tests/test_strformat.py
tests/test_url.py
tests/checker/test_error.py
tests/checker/test_news.py
2020-05-08 10:45:31 +01:00
Chris Mayo
4d3e5abcfa Remove u string prefixes 2020-04-30 20:11:59 +01:00
Chris Mayo
9eed070a73 Stop using HTML handlers
LinkFinder is the only remaining HTML handler therefore no need for
htmlsoup.process_soup() as an independent function or TagFinder as a
base class.
2020-04-29 20:07:00 +01:00
Marius Gedminas
680783b1ff SWF files are binary data
Should fix #372.
2020-04-27 11:25:37 +03:00
Chris Mayo
d189445a8e LinkFinder does not raise StopParse 2020-04-18 20:30:46 +01:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
Chris Mayo
0795e3c1b4 Replace Parser class using BeautifulSoup.find_all() 2020-04-10 13:51:09 +01:00
Chris Mayo
02e1c389b2 Remove parser flush() and reset()
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06 Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
f5ae90e824 Parser threading lock no longer required with Beautiful Soup 2020-03-22 19:54:37 +00:00
Chris Mayo
646e138166 Pass encoding when unquoting
Else non-UTF-8 codes are misinterpreted:

>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
2019-10-05 19:38:57 +01:00
Chris Mayo
153e53ba03 Reuse soup object used for detecting encoding in the HTML parser 2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf Don't set parser.encoding
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Chris Mayo
ec8b6e09f0 Fix XmlTagUrlParser and make Python 3 compatible
URLs within a sitemap file were not being captured.
2019-10-28 19:20:05 +00:00
Marius Gedminas
a1af1e9717 Fix sitemap parser
PyExpat wants bytes on Python 2.  See #323.
2019-10-23 17:23:23 +03:00
Marius Gedminas
84dbb5d603 Fix TypeError: string arg required in find_links()
Fixes #317.
2019-10-21 17:47:46 +03:00
Chris Mayo
e01ea0d9f0 Safari bookmark parser requires bytes 2019-09-30 19:46:24 +01:00
Chris Mayo
0c90c718bf Revert "Python3: fix bytes mark in parser/__init__.py"
This reverts commit aec8243348.
2019-09-30 19:46:24 +01:00
Petr Dlouhý
aec8243348 Python3: fix bytes mark in parser/__init__.py 2019-04-09 20:09:35 +01:00
Yaroslav Halchenko
7ed7919692 RF: place parser.flush() under mutex as well
Just a safety measure, not yet proven to be required but overall
makes sense
2018-11-06 10:58:10 -05:00
Yaroslav Halchenko
ee27e178ec BF: place a mutex around apparently thread-unsafe parser.feed invocation
That leads to fix up of anchors analysis and probably other issues
such as floating number of found urls etc
2018-11-01 11:10:01 -04:00
Bastian Kleineidam
ee4545399d Support itms-services: URLs. #532 2014-09-05 21:06:10 +02:00
Bastian Kleineidam
ad8eb424f3 Merge Mark-Hetherington-xml-parse-warn with slight modifications. 2014-06-13 20:50:37 +02:00
Bastian Kleineidam
82dd76b0d7 Add PDF link parsing. 2014-04-28 18:13:45 +02:00
Bastian Kleineidam
b6b5c7a12e Simpler link parsing routine. 2014-03-27 19:49:17 +01:00
Bastian Kleineidam
fab2c2da98 Improve content type setting. 2014-03-05 20:12:19 +01:00
Bastian Kleineidam
ef13a3fce1 Implement sitemap and sitemap index parsing. 2014-03-05 19:26:37 +01:00
Bastian Kleineidam
00bd549c0c Remove duplicate content type map. 2014-03-05 19:24:58 +01:00
Bastian Kleineidam
f9bf831804 Remove some empty lines 2014-03-01 12:02:00 +01:00
Bastian Kleineidam
9d0255e156 Fix bookmark imports 2014-03-01 10:16:29 +01:00
Bastian Kleineidam
7b34be590b Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements. 2014-03-01 00:12:34 +01:00
calvin
5a644b35b3 removed
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1354 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-07-06 22:08:05 +00:00
calvin
3bbfac47c7 removed
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1353 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-07-06 20:34:00 +00:00
calvin
bde88f9715 added string utils to parser, and sync with webcleaner
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1350 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-07-02 18:25:00 +00:00
calvin
04e0a9448d updated
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1325 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-05-27 00:30:37 +00:00
calvin
fa46757bd7 fix import
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1298 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-04-04 08:34:21 +00:00
calvin
68451e65dd O3
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1297 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-04-04 08:31:57 +00:00
calvin
93253954a8 updated
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1296 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-04-04 08:30:48 +00:00
calvin
672e118d9b use sorted dict
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1295 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-04-04 08:30:38 +00:00
calvin
8e4e92dddd minor improvements
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1294 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-04-04 08:30:21 +00:00
calvin
1b148b0b4e sorted dict
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1293 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-04-04 08:30:01 +00:00
calvin
e183ac84dc handle missing startquotes
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1292 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-04-04 08:29:31 +00:00
calvin
4df200a2d2 merged from webcleaner
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1205 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2004-01-28 23:38:00 +00:00