Commit graph

856 commits

Author SHA1 Message Date
Chris Mayo
40f43ae41c Create one function to make soup objects 2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06 Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
Chris Mayo
3ff3d72492 Use BeautifulSoup element attrs directly 2020-04-03 19:24:08 +01:00
Chris Mayo
28701e291a Remove use of Python 2 unicode() and related u prefixes
Several instances for MS Windows left unchanged.
2020-04-01 19:39:50 +01:00
Chris Mayo
5b66964afa Remove unused .charset from checker classes
Unused since:
4f8c2954 ("Don't set parser.encoding", 2019-10-05)
2020-03-30 19:32:30 +01:00
Chris Mayo
5eaad24641 Use HTTP header encoding for decoding 2020-03-22 19:54:37 +00:00
Chris Mayo
a9f147c347 Update fileutil.pathencode() because paths are now strings 2019-10-05 19:38:57 +01:00
Chris Mayo
646e138166 Pass encoding when unquoting
Else non-UTF-8 codes are misinterpreted:

>>> from urllib import parse
>>> parse.unquote("%FF")
'�'
>>> parse.unquote("%FF", "latin1")
'ÿ'
2019-10-05 19:38:57 +01:00
Chris Mayo
153e53ba03 Reuse soup object used for detecting encoding in the HTML parser 2019-10-05 19:38:57 +01:00
Chris Mayo
607328d5c5 Support Beautiful Soup line numbers 2019-10-05 19:38:57 +01:00
Chris Mayo
4f8c2954cf Don't set parser.encoding
Read-only property with new Beautiful Soup parser.
2019-10-05 19:38:57 +01:00
Marius Gedminas
58b0d5aaae Fix TypeError: string arg required in content_allows_robots()
See #323 an #317.
2019-10-22 14:13:45 +03:00
anarcat
f73ba54a2a
Merge pull request #308 from cjmayo/decode
Decode content when retrieved
2019-10-10 09:46:32 -04:00
Chris Mayo
5732606c58 Remove urlutil.decode_for_unquote()
Not needed since all content is now being decoded on retrieval.

Added by:
a6643034 ("Python3: decode parts before submitting them to urllib.quote()", 2018-01-05)
2019-10-04 19:37:09 +01:00
Chris Mayo
2776eb5f52 Revert "Python3: fix opening file URLs"
This reverts commit 4c9ec511b5.
2019-10-04 19:37:09 +01:00
Chris Mayo
5fc01455b7 Decode content when retrieved, use bs4 to detect encoding if non-Unicode
UrlBase has been modified as follows:
- the "data" variable now holds bytes
- decoded content is stored in a new variable "text"
- functionality from get_content() has been split out into
  get_raw_content() which returns "data" and download_content() which
  calls read_content() and sets the download related variables.
  This allows for subclasses to do their own decoding and parsers to
  use bytes.
2019-09-30 19:46:24 +01:00
Chris Mayo
53cd9475b5 Replace deprecated cgi.escape
html provided for Python 2 by future
https://python-future.org/compatible_idioms.html#html-escaping-and-entities
2019-09-17 20:25:05 +01:00
anarcat
2c7573b3b8
Merge pull request #300 from cjmayo/python3_43
{python3_43} Python3: fix for test_telnet in urlbase.py
2019-09-16 10:08:18 -04:00
anarcat
bec68f237b
Merge pull request #299 from cjmayo/python3_42
{python3_42} fixes for Python 3: fix telneturl
2019-09-16 10:07:55 -04:00
anarcat
27d672c78b
Merge pull request #297 from cjmayo/python3_40
{python3_40} Python3: fixes form checker/__init__.py
2019-09-16 10:06:05 -04:00
Petr Dlouhý
c2af88ad2e Python3: fix for test_telnet in urlbase.py 2019-09-15 19:49:26 +01:00
Petr Dlouhý
a2e67af7b4 fixes for Python 3: fix telneturl 2019-09-15 19:49:18 +01:00
Petr Dlouhý
bb542b00e9 Python3: fixes form checker/__init__.py 2019-09-15 19:49:00 +01:00
Chris Mayo
06fdd78f91 Python3: fix TypeError in HttpUrl.read_content()
From test_http_redirect:

  File "linkchecker/linkcheck/checker/httpurl.py", line 323, in read_content
    line: buf.write(data)
    locals:
      buf = <local> <_io.StringIO object at 0x7f8fe2f45e10>
      buf.write = <local> <built-in method write of _io.StringIO object at 0x7f8fe2f45e10>
      data = <local> b'<a href="newurl.html">Recursive Redirect</a>\n'
TypeError: string argument expected, got 'bytes'
2019-09-15 19:42:29 +01:00
Chris Mayo
4c9ec511b5 Python3: fix opening file URLs
urllib.request.urlopen() expects a string or Request object.
2019-09-12 19:58:27 +01:00
anarcat
2239458966
Merge pull request #285 from cjmayo/python3_34
{python3_34} fixes for Python 3: fix test_misc
2019-09-11 09:48:14 -04:00
anarcat
492058a360
Merge pull request #281 from cjmayo/python3_30
{python3_30} Python3: fix decoding strings
2019-09-11 09:47:10 -04:00
Petr Dlouhý
f272206110 Python3: fix decoding strings 2019-09-10 19:52:23 +01:00
Petr Dlouhý
e10f25b968 fixes for Python 3: fix running problems in Python 3 2019-09-10 19:30:09 +01:00
Petr Dlouhý
129a68da38 fixes for Python 3: fix test_misc 2019-09-09 19:51:30 +01:00
Petr Dlouhý
ffb0a68ff7 Python3: fix fileurl 2019-09-05 19:41:53 +01:00
anarcat
59fe9ed876
Merge pull request #228 from cjmayo/python3_18
{python3_18} Python3: fix unicode in urlbase
2019-04-25 16:17:00 -04:00
anarcat
70f0bbf225
Merge pull request #250 from cjmayo/ftpserver
Get FtpServerTest working by updating to current pyftpdlib API
2019-04-25 16:16:33 -04:00
Petr Dlouhý
e92b0a9f7b Python3: fix unicode in urlbase 2019-04-25 19:57:45 +01:00
Petr Dlouhý
b3881ce3b5 Python3: fix urlbase, strformat and others 2019-04-25 19:57:45 +01:00
anarcat
bb0a1e1992
Merge pull request #242 from cjmayo/wummel
Update references to GitHub project from wummel to linkchecker
2019-04-24 10:58:15 -04:00
anarcat
ee8667e1ca
Merge pull request #229 from cjmayo/python3_19
{python3_19} Python3: fix unicode in fileurl
2019-04-24 10:57:45 -04:00
Chris Mayo
f60810b050 Fix Python 3 "TypeError: decoding str is not supported" in FtpUrl.cwd 2019-04-22 19:34:46 +01:00
Petr Dlouhý
b40f4722c7 Python3: fix unicode in fileurl 2019-04-19 20:42:38 +01:00
EsuS
004632a99b Update references to GitHub project from wummel to linkchecker
Remove all mention of donations.
2019-04-18 19:59:52 +01:00
Petr Dlouhý
bc99dc51de Python3: fix HtmlParser 2019-04-18 19:35:16 +01:00
Petr Dlouhý
4acabf5cb5 fix urllib imports 2019-04-09 20:09:35 +01:00
gerdneuman
de6a82b378
Added whatsapp:// to ignored protocols
Fixes https://github.com/wummel/linkchecker/issues/595
2018-08-09 13:49:15 +02:00
Petr Dlouhý
256202a20b fixes for Python 3: fix proxysuport 2018-01-19 09:52:43 +01:00
Antoine Beaupré
9b12b5d66f
workaround new limitation in requests
newer requests do not expose the internal SSL socket object so we
cannot verify certificates. there was work to allow custom
verification routines which we could use, but this never finished:

https://github.com/shazow/urllib3/pull/257

so right now, just treat missing socket information as if the cert was
missing.

Closes: #76
2017-10-02 20:19:25 -04:00
Graham Seaman
233e7dcf68 Allow wayback-format urls without affecting atom 'feed' urls 2017-02-09 11:43:45 +00:00
Antoine Beaupré
9d899d1dfa add --no-robots commandline flag
While this flag can be abused, it seems to me like a legitimate use
case that you want to check a fairly small document for mistakes,
which includes references to a website which has a robots.txt that
denies all robots. It turns out that most websites do *not* add a
permission for LinkCheck to use their site, and some sites, like the
Debian BTS for example, are very hostile with bots in general.

Between me using linkcheck and me using my web browser to check those
links one by one, there is not a big difference. In fact, using
linkcheck may be *better* for the website because it will use HEAD
requests instead of a GET, and will not fetch all page elements
(javascript, images, etc) which can often be fairly big.

Besides, hostile users will patch the software themselves: it took me
only a few minutes to disable the check, and a few more to make that
into a proper patch.

By forcing robots.txt without any other option, we are hurting our
good users and not keeping hostile users from doing harm.

The patch is still incomplete, but works. It lacks: documentation and
unit tests.

Closes: #508
2016-05-19 14:43:59 -04:00
Bastian Kleineidam
92c4ca9a5e Debug request headers 2014-09-20 12:16:24 +02:00
Bastian Kleineidam
029c20ed98 More python3 fixes 2014-09-12 21:59:07 +02:00
Bastian Kleineidam
35eb30432e Added some Python3 fixes. 2014-09-12 19:36:30 +02:00