Marius Gedminas
|
680783b1ff
|
SWF files are binary data
Should fix #372.
|
2020-04-27 11:25:37 +03:00 |
|
Chris Mayo
|
d189445a8e
|
LinkFinder does not raise StopParse
|
2020-04-18 20:30:46 +01:00 |
|
Chris Mayo
|
ee6628a831
|
Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
|
2020-04-18 20:30:45 +01:00 |
|
Chris Mayo
|
0795e3c1b4
|
Replace Parser class using BeautifulSoup.find_all()
|
2020-04-10 13:51:09 +01:00 |
|
Chris Mayo
|
02e1c389b2
|
Remove parser flush() and reset()
Remnants of the feed() interface.
|
2020-04-08 20:03:35 +01:00 |
|
Chris Mayo
|
9d8d251d06
|
Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
|
2020-04-08 20:03:35 +01:00 |
|
Chris Mayo
|
f5ae90e824
|
Parser threading lock no longer required with Beautiful Soup
|
2020-03-22 19:54:37 +00:00 |
|
Chris Mayo
|
646e138166
|
Pass encoding when unquoting
Else non-UTF-8 codes are misinterpreted:
>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
|
2019-10-05 19:38:57 +01:00 |
|
Chris Mayo
|
153e53ba03
|
Reuse soup object used for detecting encoding in the HTML parser
|
2019-10-05 19:38:57 +01:00 |
|
Chris Mayo
|
4f8c2954cf
|
Don't set parser.encoding
Read-only property with new Beautiful Soup parser.
|
2019-10-05 19:38:57 +01:00 |
|
Chris Mayo
|
ec8b6e09f0
|
Fix XmlTagUrlParser and make Python 3 compatible
URLs within a sitemap file were not being captured.
|
2019-10-28 19:20:05 +00:00 |
|
Marius Gedminas
|
a1af1e9717
|
Fix sitemap parser
PyExpat wants bytes on Python 2. See #323.
|
2019-10-23 17:23:23 +03:00 |
|
Marius Gedminas
|
84dbb5d603
|
Fix TypeError: string arg required in find_links()
Fixes #317.
|
2019-10-21 17:47:46 +03:00 |
|
Chris Mayo
|
e01ea0d9f0
|
Safari bookmark parser requires bytes
|
2019-09-30 19:46:24 +01:00 |
|
Chris Mayo
|
0c90c718bf
|
Revert "Python3: fix bytes mark in parser/__init__.py"
This reverts commit aec8243348.
|
2019-09-30 19:46:24 +01:00 |
|
Petr Dlouhý
|
aec8243348
|
Python3: fix bytes mark in parser/__init__.py
|
2019-04-09 20:09:35 +01:00 |
|
Yaroslav Halchenko
|
7ed7919692
|
RF: place parser.flush() under mutex as well
Just a safety measure, not yet proven to be required but overall
makes sense
|
2018-11-06 10:58:10 -05:00 |
|
Yaroslav Halchenko
|
ee27e178ec
|
BF: place a mutex around apparently thread-unsafe parser.feed invocation
That leads to fix up of anchors analysis and probably other issues
such as floating number of found urls etc
|
2018-11-01 11:10:01 -04:00 |
|
Bastian Kleineidam
|
ee4545399d
|
Support itms-services: URLs. #532
|
2014-09-05 21:06:10 +02:00 |
|
Bastian Kleineidam
|
ad8eb424f3
|
Merge Mark-Hetherington-xml-parse-warn with slight modifications.
|
2014-06-13 20:50:37 +02:00 |
|
Bastian Kleineidam
|
82dd76b0d7
|
Add PDF link parsing.
|
2014-04-28 18:13:45 +02:00 |
|
Bastian Kleineidam
|
b6b5c7a12e
|
Simpler link parsing routine.
|
2014-03-27 19:49:17 +01:00 |
|
Bastian Kleineidam
|
fab2c2da98
|
Improve content type setting.
|
2014-03-05 20:12:19 +01:00 |
|
Bastian Kleineidam
|
ef13a3fce1
|
Implement sitemap and sitemap index parsing.
|
2014-03-05 19:26:37 +01:00 |
|
Bastian Kleineidam
|
00bd549c0c
|
Remove duplicate content type map.
|
2014-03-05 19:24:58 +01:00 |
|
Bastian Kleineidam
|
f9bf831804
|
Remove some empty lines
|
2014-03-01 12:02:00 +01:00 |
|
Bastian Kleineidam
|
9d0255e156
|
Fix bookmark imports
|
2014-03-01 10:16:29 +01:00 |
|
Bastian Kleineidam
|
7b34be590b
|
Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements.
|
2014-03-01 00:12:34 +01:00 |
|
calvin
|
5a644b35b3
|
removed
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1354 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-07-06 22:08:05 +00:00 |
|
calvin
|
3bbfac47c7
|
removed
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1353 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-07-06 20:34:00 +00:00 |
|
calvin
|
bde88f9715
|
added string utils to parser, and sync with webcleaner
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1350 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-07-02 18:25:00 +00:00 |
|
calvin
|
04e0a9448d
|
updated
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1325 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-05-27 00:30:37 +00:00 |
|
calvin
|
fa46757bd7
|
fix import
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1298 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-04-04 08:34:21 +00:00 |
|
calvin
|
68451e65dd
|
O3
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1297 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-04-04 08:31:57 +00:00 |
|
calvin
|
93253954a8
|
updated
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1296 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-04-04 08:30:48 +00:00 |
|
calvin
|
672e118d9b
|
use sorted dict
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1295 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-04-04 08:30:38 +00:00 |
|
calvin
|
8e4e92dddd
|
minor improvements
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1294 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-04-04 08:30:21 +00:00 |
|
calvin
|
1b148b0b4e
|
sorted dict
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1293 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-04-04 08:30:01 +00:00 |
|
calvin
|
e183ac84dc
|
handle missing startquotes
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1292 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-04-04 08:29:31 +00:00 |
|
calvin
|
4df200a2d2
|
merged from webcleaner
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1205 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-01-28 23:38:00 +00:00 |
|
calvin
|
f4dde29117
|
parse fixes merged from webcleaner
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1204 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-01-28 23:04:39 +00:00 |
|
calvin
|
66ecc466b7
|
resolve entities
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1202 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-01-28 22:48:50 +00:00 |
|
calvin
|
26072afd92
|
new style parser object class
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1200 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-01-28 22:33:34 +00:00 |
|
calvin
|
2398ee2aa3
|
copyright updated
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1153 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-01-03 15:12:04 +00:00 |
|
calvin
|
fef96392d6
|
updated copyright
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1150 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2004-01-03 14:59:33 +00:00 |
|
calvin
|
02f42652fe
|
cosmetic fix
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1125 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2003-12-29 19:11:48 +00:00 |
|
calvin
|
8da038d3ef
|
rebuild parser, ignore invalid leading attr backslash
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1031 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2003-09-11 21:54:36 +00:00 |
|
calvin
|
3fef6f6023
|
fix js script comment line parsing
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1026 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2003-08-29 11:58:59 +00:00 |
|
calvin
|
7ff1edcd90
|
fix parsing of trailing end tag garbage
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1023 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2003-08-19 11:48:49 +00:00 |
|
calvin
|
c03e824438
|
use new-style classes
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@1008 e7d03fd6-7b0d-0410-9947-9c21f3af8025
|
2003-08-11 13:19:39 +00:00 |
|