Commit graph

3273 commits

Author SHA1 Message Date
Chris Mayo
4c9ec511b5 Python3: fix opening file URLs
urllib.request.urlopen() expects a string or Request object.
2019-09-12 19:58:27 +01:00
anarcat
eb2e3271a2
Merge pull request #279 from cjmayo/python3_28
{python3_28} Python3: fix robotparser
2019-09-12 08:40:18 -04:00
anarcat
8c072fa757
Merge pull request #289 from cjmayo/python3_38
{python3_38} Python3: fix linkname.py
2019-09-12 08:39:29 -04:00
Petr Dlouhý
538c4cfeb9 Python3: fix linkname.py 2019-09-11 20:32:33 +01:00
Petr Dlouhý
8a294be95f Python3: fix robotparser 2019-09-11 20:04:26 +01:00
anarcat
44944754d5
Merge pull request #286 from cjmayo/python3_35
{python3_35} Python3: fix unichr() in htmlparser
2019-09-11 09:48:35 -04:00
anarcat
2239458966
Merge pull request #285 from cjmayo/python3_34
{python3_34} fixes for Python 3: fix test_misc
2019-09-11 09:48:14 -04:00
anarcat
dbbb64cd90
Merge pull request #283 from cjmayo/python3_32
{python3_32} fixes for Python 3 + Travis test: fix threads
2019-09-11 09:47:44 -04:00
anarcat
492058a360
Merge pull request #281 from cjmayo/python3_30
{python3_30} Python3: fix decoding strings
2019-09-11 09:47:10 -04:00
anarcat
8eadc5f8a1
Merge pull request #280 from cjmayo/python3_29
{python3_29} fixes for Python 3: fix running problems in Python 3
2019-09-11 09:46:48 -04:00
Petr Dlouhý
f272206110 Python3: fix decoding strings 2019-09-10 19:52:23 +01:00
Petr Dlouhý
55a7973b93 Python3: fix csvlog 2019-09-10 19:42:26 +01:00
Petr Dlouhý
e10f25b968 fixes for Python 3: fix running problems in Python 3 2019-09-10 19:30:09 +01:00
Petr Dlouhý
d20ac0e108 Python3: fix strformat strline() 2019-09-09 19:51:30 +01:00
Petr Dlouhý
8b9f29ae52 Python3: fix unichr() in htmlparser 2019-09-09 19:51:30 +01:00
Petr Dlouhý
129a68da38 fixes for Python 3: fix test_misc 2019-09-09 19:51:30 +01:00
Petr Dlouhý
57f7ba0979 fixes for Python 3 + Travis test: fix threads 2019-09-09 19:51:30 +01:00
Marius Gedminas
60f9f80b9f Fix test_console.py on Python 3
This is a alternative fix I suggested in the comments on PR #273.
2019-09-09 18:52:29 +03:00
anarcat
4e6c806bff
Merge pull request #274 from cjmayo/python3_24
{python3_24} Python3: fix logger
2019-09-09 11:50:04 -04:00
Marius Gedminas
bb573e5eb1
Merge pull request #272 from cjmayo/python3_22
{python3_22} Python3: fix decode_parts function
2019-09-09 18:37:49 +03:00
anarcat
5c9376cfe2
Merge pull request #276 from cjmayo/python3_26
{python3_26} Python3: fix fileutil
2019-09-09 09:40:18 -04:00
Petr Dlouhý
0d7a2cac72 Python3: fix decode_parts function 2019-09-06 19:45:20 +01:00
Petr Dlouhý
9156576778 Python3: fix logger 2019-09-06 19:41:37 +01:00
Petr Dlouhý
ffb0a68ff7 Python3: fix fileurl 2019-09-05 19:41:53 +01:00
anarcat
59ab0644fd
Merge pull request #230 from cjmayo/python3_20
{python3_20} Python3: decode parts before submitting them to urllib.quote()
2019-09-04 09:48:19 -04:00
Petr Dlouhý
b5111453d8 change test_parse encoding to UTF-8 2019-07-22 19:59:37 +01:00
Petr Dlouhý
d6d48b4814 html parser: use name instead of peeking 2019-07-22 19:59:37 +01:00
Petr Dlouhý
51a06d8a1e Remove home-cooked htmlparser and use BeautifulSoup 2019-07-22 19:59:37 +01:00
Nick Muerdter
fb3f65cdcc
Fix CSV output containing increasing number of null byte characters.
The CSV buffer is being truncated on each new row, but since the
stream's pointer isn't also being reset, each new row starts at the same
position as the previous row, but with null bytes up until that point.
This leads to increasing growth in the length of each CSV row, since
each line will be padded with null bytes equivalent to the previous
row's length.
2019-05-31 18:52:57 -06:00
Petr Dlouhý
a6643034fb Python3: decode parts before submitting them to urllib.quote() 2019-05-10 20:06:01 +01:00
Chris Mayo
1c2e6c465e squash! Python3: fix strformat ascii_safe() and unicode_safe() 2019-05-10 08:58:52 -04:00
Petr Dlouhý
ac14585a78 Python3: fix strformat for test_file 2019-05-10 08:58:52 -04:00
Petr Dlouhý
acaf8e671e Python3: fix strformat unicode_safe() 2019-05-10 08:58:52 -04:00
Petr Dlouhý
e11ba8e427 squash! Python3: fix strformat ascii_safe() and unicode_safe()
From:
fixes for Python 3: fix running problems in Python 3
2019-05-10 08:58:52 -04:00
Petr Dlouhý
a1c6c4935e Python3: fix strformat ascii_safe() and unicode_safe() 2019-05-10 08:58:52 -04:00
anarcat
9c9706a07a
Merge pull request #256 from cjmayo/parse_qs
Replace deprecated cgi.parse_qs
2019-04-27 13:27:19 -04:00
Chris Mayo
a355476b82 Replace deprecated regexp flags not at start
DeprecationWarning: Flags not at the start of the expression
2019-04-26 19:25:59 +01:00
Chris Mayo
5ae40c1ae2 Replace deprecated cgi.parse_qs 2019-04-26 19:23:45 +01:00
anarcat
59fe9ed876
Merge pull request #228 from cjmayo/python3_18
{python3_18} Python3: fix unicode in urlbase
2019-04-25 16:17:00 -04:00
anarcat
70f0bbf225
Merge pull request #250 from cjmayo/ftpserver
Get FtpServerTest working by updating to current pyftpdlib API
2019-04-25 16:16:33 -04:00
Petr Dlouhý
e92b0a9f7b Python3: fix unicode in urlbase 2019-04-25 19:57:45 +01:00
Petr Dlouhý
b3881ce3b5 Python3: fix urlbase, strformat and others 2019-04-25 19:57:45 +01:00
anarcat
056ba1d717
Merge pull request #248 from cjmayo/donateurl
Remove configuration.DonateUrl
2019-04-24 10:59:50 -04:00
anarcat
b656346352
Merge pull request #246 from cjmayo/locale_format
Replace deprecated locale.format()
2019-04-24 10:59:17 -04:00
anarcat
a42bc14fc2
Merge pull request #243 from cjmayo/warning
Replace deprecated log.warn
2019-04-24 10:58:31 -04:00
anarcat
bb0a1e1992
Merge pull request #242 from cjmayo/wummel
Update references to GitHub project from wummel to linkchecker
2019-04-24 10:58:15 -04:00
anarcat
ee8667e1ca
Merge pull request #229 from cjmayo/python3_19
{python3_19} Python3: fix unicode in fileurl
2019-04-24 10:57:45 -04:00
anarcat
492da5aee0
Merge pull request #227 from cjmayo/python3_17
{python3_17} Python3: fix unicode in url.py
2019-04-24 10:57:09 -04:00
Chris Mayo
f60810b050 Fix Python 3 "TypeError: decoding str is not supported" in FtpUrl.cwd 2019-04-22 19:34:46 +01:00
Chris Mayo
20e11f1b1f Remove configuration.DonateUrl 2019-04-21 19:44:18 +01:00
Chris Mayo
ce1dd55d7a Replace deprecated locale.format()
locale.format_string() was introduced in Python 2.5.
2019-04-21 19:28:54 +01:00
Petr Dlouhý
b40f4722c7 Python3: fix unicode in fileurl 2019-04-19 20:42:38 +01:00
Petr Dlouhý
f4b73c6d42 Python3: fix unicode in url.py 2019-04-19 19:57:25 +01:00
Chris Mayo
46179f681c Replace deprecated log.warn
warning() has been the documented method since logging was introduced in
Python 2.3.
2019-04-18 20:10:03 +01:00
EsuS
004632a99b Update references to GitHub project from wummel to linkchecker
Remove all mention of donations.
2019-04-18 19:59:52 +01:00
Petr Dlouhý
bc99dc51de Python3: fix HtmlParser 2019-04-18 19:35:16 +01:00
Petr Dlouhý
2c6411d68e Python3: fix regexp format 2019-04-17 19:50:06 +01:00
Petr Dlouhý
8f4acc3168 Python3: use str and basestring from builtins 2019-04-16 20:08:29 +01:00
anarcat
e93d18d6e9
Merge pull request #232 from cjmayo/gzip2
Remove leftovers from introduction of requests
2019-04-15 10:31:06 -04:00
Petr Dlouhý
2985e9ae65 Use Python 3 compatible octal masks 2019-04-13 20:37:39 +01:00
Chris Mayo
ff4a2e496e Remove unused copy of gzip2
Not used since requests introduced in 7b34be590b.
2019-04-13 20:35:37 +01:00
anarcat
75626d456a
Merge pull request #217 from cjmayo/python3_07
{python3_07} Python3: use BytesIO instead of StringIO
2019-04-11 11:48:45 -04:00
anarcat
8223acd44e
Merge pull request #226 from cjmayo/python3_16
{python3_16} Python3: fix parsepdf
2019-04-11 11:47:57 -04:00
anarcat
2bdd155d56
Merge pull request #231 from cjmayo/python3_21
{python3_21} fix urllib imports
2019-04-11 11:47:50 -04:00
anarcat
ce76b7c82d
Merge pull request #222 from cjmayo/python3_12
{python3_12} Python3: fix bytes mark in parser/__init__.py
2019-04-11 11:46:41 -04:00
Petr Dlouhý
106d58c2da Python3: use BytesIO instead of StringIO 2019-04-09 20:09:35 +01:00
Petr Dlouhý
79e05d1511 Python3: fix parsepdf 2019-04-09 20:09:35 +01:00
Petr Dlouhý
4acabf5cb5 fix urllib imports 2019-04-09 20:09:35 +01:00
Petr Dlouhý
aec8243348 Python3: fix bytes mark in parser/__init__.py 2019-04-09 20:09:35 +01:00
Petr Dlouhý
033f9fbdb3 Python3: mark bytes explicitly 2019-04-09 20:09:35 +01:00
Yaroslav Halchenko
7ed7919692 RF: place parser.flush() under mutex as well
Just a safety measure, not yet proven to be required but overall
makes sense
2018-11-06 10:58:10 -05:00
Yaroslav Halchenko
ee27e178ec BF: place a mutex around apparently thread-unsafe parser.feed invocation
That leads to fix up of anchors analysis and probably other issues
such as floating number of found urls etc
2018-11-01 11:10:01 -04:00
Yaroslav Halchenko
b78c2d200e DOC: minor typo fix 2018-11-01 11:08:09 -04:00
gerdneuman
de6a82b378
Added whatsapp:// to ignored protocols
Fixes https://github.com/wummel/linkchecker/issues/595
2018-08-09 13:49:15 +02:00
regexaurus
50a9ff65b8 Updated support (issues) URL 2018-08-03 00:53:47 -04:00
Marius Gedminas
6f55f446ae Load cookies from the --cookiefile correctly
requests.cookies.merge_cookies() requires a dict or a CookieJar as the second argument.
We've been passing lists of Cookie objects instead.

Fixes #62, harder this time.
2018-03-16 13:23:26 +02:00
Marius Gedminas
6becc08284 Fix internal error when using cookies
There was some kind of confusion between a module and a function argument,
introduced in commit 90257a1b5e.

Fixes #62.
2018-03-15 23:30:41 +02:00
Petr Dlouhý
e615480850 Python3: fix reading Safari bookmarks 2018-01-19 09:52:43 +01:00
Petr Dlouhý
256202a20b fixes for Python 3: fix proxysuport 2018-01-19 09:52:43 +01:00
Petr Dlouhý
f128c9c168 Python3: fix gzip2 format 2018-01-19 09:52:43 +01:00
Petr Dlouhý
a1b300c892 Python3: fix imports 2018-01-19 09:52:43 +01:00
Petr Dlouhý
0a13fae3b4 remove third party packages and use them as dependency 2018-01-09 23:25:27 +01:00
Petr Dlouhý
2daf685633 Python3: fix few htmllib problems 2018-01-05 22:48:46 +01:00
Petr Dlouhý
fb39a4116f Python3: fix fileutil 2018-01-05 20:31:21 +01:00
Reinhold Füreder
e864bbdabf
Use os.makedirs(...) instead of os.mkdir(...) 2018-01-03 11:33:53 +01:00
Philipp Hahn
1368643a50 Fix fragment identifier quoting
According to <https://tools.ietf.org/html/rfc3986>:
 fragment    = *( pchar / "/" / "?" )
 pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
 unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
 pct-encoded = "%" HEXDIG HEXDIG
 sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Fixes #96
2017-11-10 08:03:03 -05:00
Antoine Beaupré
71be9b941b
fix incorrect call to the logging module (Closes: #847208) 2017-11-03 09:47:01 -04:00
Félix Sipma
c8d9038ae8 improve get_plugin_folders() docstring 2017-10-18 15:58:18 +02:00
Félix Sipma
deca8c667e introduce linkcheck.configuration.get_user_data() 2017-10-18 15:55:55 +02:00
Félix Sipma
a03e2e4ada use xdg dirs for config & data
~/.linkchecker is used instead of the xdg equivalents if the directory
exists (backward compatibility).
2017-10-17 18:48:07 +02:00
Antoine Beaupré
9b12b5d66f
workaround new limitation in requests
newer requests do not expose the internal SSL socket object so we
cannot verify certificates. there was work to allow custom
verification routines which we could use, but this never finished:

https://github.com/shazow/urllib3/pull/257

so right now, just treat missing socket information as if the cert was
missing.

Closes: #76
2017-10-02 20:19:25 -04:00
Marius Gedminas
4a092c218c Whitespace bigotry 2017-03-14 17:18:27 +02:00
anarcat
5471b63ceb Merge pull request #39 from PetrDlouhy/fix/cache
Fix cache: Don't check one url multiple times
2017-03-14 09:26:07 -04:00
Marius Gedminas
fb1debaa68 Fix incompatible pointer type warnings
The warnings looked like this:

    htmlparse.c: In function ‘yyparse’:
    htmlparse.c:1810:18: warning: passing argument 1 of ‘yyerror’ from incompatible pointer type [-Wincompatible-pointer-types]
    htmlparse.y:40:13: note: expected ‘PyObject ** {aka struct _object **}’ but argument is of type ‘PyObject * {aka struct _object *}’
    htmlparse.c:1927:12: warning: passing argument 1 of ‘yyerror’ from incompatible pointer type [-Wincompatible-pointer-types]
    htmlparse.y:40:13: note: expected ‘PyObject ** {aka struct _object **}’ but argument is of type ‘PyObject * {aka struct _object *}’

The argument is not used, so it doesn't really matter what pointer type
it is.
2017-02-24 15:04:09 +02:00
Petr Dlouhý
eaa538c814 don't check one url multiple times 2017-02-14 10:23:25 +01:00
Marius Gedminas
03dfe3d3a1 Fix "operation on ... may be undefined" [-Wsequence-point] warnings
Fixes a bunch of warnings like

  htmlparse.y:509:25: warning: operation on ‘self->userData->buf’ may be undefined [-Wsequence-point]
  htmlparse.y:518:29: warning: operation on ‘self->userData->tmp_buf’ may be undefined [-Wsequence-point]

which were a result of (macro-expanded) code like this (simplified):

  if ((tmp = (tmp = PyMem_Realloc(...))) == NULL) return NULL;

The PyMem_Resize(p, ...) macro assigns the new value to p before
returning it, so there's no need to assign it again.

See http://bugs.python.org/issue1668036 for evidence (from 2007) that
this is indeed a documented side-effect of the macro API.
2017-02-13 15:20:33 +02:00
Graham Seaman
233e7dcf68 Allow wayback-format urls without affecting atom 'feed' urls 2017-02-09 11:43:45 +00:00
Marius Gedminas
743a5f31cb Crawl HTML attributes in deterministic order
Fixes #17.
2017-02-01 19:19:53 +02:00
Graham Seaman
2e32780dc7 Force header names to lower to allow for CaseInsensitvieDict variability 2017-02-01 16:28:07 +00:00
Marius Gedminas
3c99b6aa30 Fix TypeError: hasattr(): attribute name must be string
The one test failure in Travis happens in
TestConsole.test_internal_error, but only if you have the argcomplete
package installed.

This was a real bug in error reporting code.
2017-02-01 16:02:35 +02:00
Antoine Beaupré
d51b7f34b6 Merge branch '9.3.x' 2017-01-31 19:21:22 -05:00
Antoine Beaupré
da8cecd83c Merge remote-tracking branch 'anarcat/norobots' 2017-01-31 11:34:09 -05:00
Antoine Beaupré
bf45fb1884 fix HTTPS URL checks
in Debian Jessie, linkchecker fails because of an API problem.

it completely breaks HTTPs checks.

this patch fixes the problem

from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=772947
2017-01-31 11:25:45 -05:00
Bastian Kleineidam
1e291afdfa Fix python requests version check 2017-01-31 11:25:38 -05:00
Antoine Beaupré
46d96d0aa0 fix HTTPS URL checks
in Debian Jessie, linkchecker fails because of an API problem.

it completely breaks HTTPs checks.

this patch fixes the problem

from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=772947
2016-09-30 11:20:38 -04:00
Bastian Kleineidam
c2ce810c3f Fix python requests version check 2016-06-28 21:55:10 +02:00
Antoine Beaupré
9d899d1dfa add --no-robots commandline flag
While this flag can be abused, it seems to me like a legitimate use
case that you want to check a fairly small document for mistakes,
which includes references to a website which has a robots.txt that
denies all robots. It turns out that most websites do *not* add a
permission for LinkCheck to use their site, and some sites, like the
Debian BTS for example, are very hostile with bots in general.

Between me using linkcheck and me using my web browser to check those
links one by one, there is not a big difference. In fact, using
linkcheck may be *better* for the website because it will use HEAD
requests instead of a GET, and will not fetch all page elements
(javascript, images, etc) which can often be fairly big.

Besides, hostile users will patch the software themselves: it took me
only a few minutes to disable the check, and a few more to make that
into a proper patch.

By forcing robots.txt without any other option, we are hurting our
good users and not keeping hostile users from doing harm.

The patch is still incomplete, but works. It lacks: documentation and
unit tests.

Closes: #508
2016-05-19 14:43:59 -04:00
Bastian Kleineidam
0ef00eea56 Move GUI files to separate project 2016-01-23 13:28:15 +01:00
Bastian Kleineidam
549533d701 Improved debugging 2016-01-19 21:55:50 +01:00
wummel
a40c39be59 Merge pull request #560 from xvadim/feature
Added plugin for parsing and checking links in Markdown files
2016-01-19 07:30:34 +01:00
wummel
e2556abbb6 Merge pull request #561 from nbigaouette/issue555
Detect if "url_data" contains proxy attributes before using them.
2016-01-17 21:59:35 +01:00
Bastian Kleineidam
3d711666e1 Fix parser for changes in bison 3.0.x 2015-11-26 12:33:44 +01:00
Nicolas Bigaouette
4e56eceb35 Detect if "url_data" contains proxy attributes before using them.
Fix proposed by @colwilson in issue #555.
2014-11-12 09:58:30 -05:00
Vadim Khohlov
d4352fc828 Added plugin for parsing and checking links in Markdown files 2014-11-11 15:35:18 +02:00
Bastian Kleineidam
27937e6f83 Fix requests module version check. 2014-09-22 22:45:04 +02:00
Bastian Kleineidam
228bce1ba2 Add to instead of replace the HTTP client headers. 2014-09-20 12:17:42 +02:00
Bastian Kleineidam
92c4ca9a5e Debug request headers 2014-09-20 12:16:24 +02:00
Bastian Kleineidam
029c20ed98 More python3 fixes 2014-09-12 21:59:07 +02:00
Bastian Kleineidam
35eb30432e Added some Python3 fixes. 2014-09-12 19:36:30 +02:00
Bastian Kleineidam
697e7b82e1 Search for system certs 2014-09-11 21:19:49 +02:00
Bastian Kleineidam
21c7200360 Reactivate paging of help pages. 2014-09-11 19:42:42 +02:00
Bastian Kleineidam
06c6b80ed3 Fix proxy support. 2014-09-05 22:48:10 +02:00
wummel
6580d37dc9 Merge pull request #545 from ArloL/patch-1
Use correct attribute
2014-09-05 21:13:40 +02:00
Bastian Kleineidam
ee4545399d Support itms-services: URLs. #532 2014-09-05 21:06:10 +02:00
Bastian Kleineidam
37d4ed6f83 Add hyphen and dot to the allowed scheme characters. 2014-09-05 20:59:54 +02:00
Bastian Kleineidam
c8df9355f0 Try to use the SSL certs from the certifi package. 2014-09-05 20:00:30 +02:00
Bastian Kleineidam
c684918ba6 Ignore urllib3 warnings about invalid SSL certs since we check them ourselves. 2014-09-05 20:00:00 +02:00
Bastian Kleineidam
2354f16dbb Catch urllib3 errors. 2014-09-05 19:59:28 +02:00
Arlo Louis O'Keeffe
52337f82cb Use correct attribute 2014-09-03 09:36:22 +02:00
Bastian Kleineidam
85dadc1f1a Add documentation 2014-07-16 07:37:19 +02:00
Bastian Kleineidam
37664ea8a4 Fix Word file check plugin. 2014-07-15 22:39:41 +02:00
Bastian Kleineidam
b646293fd6 Remove unused import. 2014-07-15 22:38:57 +02:00
Bastian Kleineidam
29193bbcc9 Fix login URL cookies and don't sanitize after config reading. 2014-07-15 22:23:38 +02:00
Bastian Kleineidam
032c4091c3 Some easy python3 compatibility changes. 2014-07-15 18:40:47 +02:00
Bastian Kleineidam
90257a1b5e Replace twill with custom code. 2014-07-15 18:37:05 +02:00
Bastian Kleineidam
a665d35feb Use proxies and checker session in robots.txt. 2014-07-14 20:28:28 +02:00
Bastian Kleineidam
266e9e189f Further code cleanup. 2014-07-14 20:14:00 +02:00
Bastian Kleineidam
6c38b4165a Use given HTTP auth data for robots.txt fetching. 2014-07-14 19:50:11 +02:00
Bastian Kleineidam
7838521b6e Code cleanup. 2014-07-14 19:49:01 +02:00
Bastian Kleineidam
100ce11d40 Sanitize CGI configuration. 2014-07-13 21:56:01 +02:00
Bastian Kleineidam
eafa1ed2da Updated unknown URL schemes. 2014-07-13 21:51:53 +02:00
Bastian Kleineidam
176b95a30e Do not strip quotes from resolved URLs. 2014-07-11 00:43:46 +02:00
Bastian Kleineidam
27702ddbac Catch log output start errors. 2014-07-09 21:54:47 +02:00
Bastian Kleineidam
6ff89e9e8c Fix GUI startup 2014-07-06 20:20:03 +02:00
Bastian Kleineidam
0fa7ed2699 Fix empty URL handling. 2014-07-03 23:34:40 +02:00
Bastian Kleineidam
1590ab6240 cleanup 2014-07-01 21:12:47 +02:00
Bastian Kleineidam
9a124513e3 Merge branch 'master' of github.com:wummel/linkchecker 2014-07-01 21:11:33 +02:00
wummel
9bb3852edf Merge pull request #515 from Mark-Hetherington/extern-redirect
When following redirections update url.extern
2014-07-01 21:11:13 +02:00
Bastian Kleineidam
12cc12db53 Add get_redirects() function. 2014-07-01 21:11:06 +02:00
Bastian Kleineidam
cde261c009 Parse Refresh: and Content-Location: header values for URLs. 2014-07-01 20:16:43 +02:00
Bastian Kleineidam
c3ec91ac6d Fix intern URL search pattern. 2014-06-13 23:52:21 +02:00
Bastian Kleineidam
ad8eb424f3 Merge Mark-Hetherington-xml-parse-warn with slight modifications. 2014-06-13 20:50:37 +02:00
Mark Hetherington
34d83db29c When following redirections update url.extern 2014-05-19 14:59:58 +10:00
Bastian Kleineidam
eaa8a963ec Refactor logging configuration. 2014-05-10 21:23:06 +02:00
Bastian Kleineidam
4b28e6e860 Move mime stuff into own submodule. 2014-05-10 21:22:10 +02:00
Bastian Kleineidam
9b794b936c Print interrupt note in text output. 2014-04-30 20:17:33 +02:00
Bastian Kleineidam
43c2e6641b Logging refactor, interrupt and abort flags added. 2014-04-30 19:59:43 +02:00
Bastian Kleineidam
b152ce7a6e Add PDF test and fix page number. 2014-04-29 18:53:24 +02:00
Bastian Kleineidam
0d9881cf03 Fix add_url() with local files. 2014-04-29 18:43:21 +02:00
Bastian Kleineidam
82dd76b0d7 Add PDF link parsing. 2014-04-28 18:13:45 +02:00
Bastian Kleineidam
0ffdea2b8d Added parser plugins and the applies_to() function. 2014-04-28 18:11:19 +02:00
Bastian Kleineidam
0f8ee234c3 Fix documentation. 2014-04-28 18:10:20 +02:00
Bastian Kleineidam
6bae3e0f49 Use the same request arguments for redirects. 2014-04-23 22:03:44 +02:00
Bastian Kleineidam
981079c041 Support itemtype attribute parsing. 2014-04-23 22:03:20 +02:00
Bastian Kleineidam
4232b69633 Support <img> srcset attribute parsing. 2014-04-10 17:51:59 +02:00
Bastian Kleineidam
6caf654031 Parse Link: heaaders. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
22caa9367a Refactor recursion checks. 2014-04-10 17:50:55 +02:00
Bastian Kleineidam
08fbd891ef Do not check external robots.txt sitemaps. 2014-04-09 19:44:29 +02:00
Bastian Kleineidam
c57f607fc3 Use urldata.add_url() 2014-04-07 18:54:33 +02:00
Bastian Kleineidam
9c5693ad41 Add doc and copyright. 2014-03-30 19:23:42 +02:00
Bastian Kleineidam
4759cee377 Updated mailto: documentation. 2014-03-30 08:30:14 +02:00
Bastian Kleineidam
b6b5c7a12e Simpler link parsing routine. 2014-03-27 19:49:17 +01:00
Bastian Kleineidam
f180592cc4 Increase thread poll intervall to reduce CPU usage. 2014-03-27 17:43:14 +01:00
Bastian Kleineidam
81da2eb48f Code cleanup 2014-03-27 17:19:52 +01:00
Bastian Kleineidam
da0ef8e8ea Fix for moved functions. 2014-03-27 17:19:24 +01:00
Bastian Kleineidam
fa26876f67 Don't use encoding detection since it's very slow. 2014-03-27 12:27:11 +01:00
Bastian Kleineidam
8cf84be2e2 Fix pyopenssl certificate date parsing. 2014-03-26 20:25:44 +01:00
Bastian Kleineidam
49df359317 Some fixes when pyopenssl is used instead of python ssl module. 2014-03-26 19:59:17 +01:00
Bastian Kleineidam
dec0f6c8dc Fix error with SNI checks 2014-03-26 12:38:16 +01:00
Bastian Kleineidam
a8623bc0bc Display SSL info on redirects. 2014-03-26 07:16:03 +01:00
Bastian Kleineidam
be59802569 Set http connection charset. 2014-03-20 21:20:34 +01:00
Bastian Kleineidam
098dede12c Fix warningregex setting in GUI. 2014-03-20 20:46:58 +01:00
Bastian Kleineidam
9cd67dfcb2 More SSL message work. 2014-03-20 20:24:57 +01:00
Bastian Kleineidam
4c76345338 Add certificate valid date info and always set verify flag. 2014-03-19 17:16:42 +01:00
Bastian Kleineidam
9a7ad3a84f Print SSL cipher info for https URLs. 2014-03-19 17:02:34 +01:00
Bastian Kleineidam
931ca4f402 Add missing log keyword arg. 2014-03-19 17:02:00 +01:00
Bastian Kleineidam
71a7898ee6 Don't check non-connected URLs. 2014-03-19 16:33:38 +01:00
Bastian Kleineidam
ce733ae76b Don't check for robots.txt directives in local html files. 2014-03-19 16:33:22 +01:00
Bastian Kleineidam
e528d5f7db Fix ssl connection handling and change plugin type to connection plugin. 2014-03-19 14:28:33 +01:00
Bastian Kleineidam
9be667b52a Do not warn about missing addresses on mailto links that have subjects. 2014-03-18 23:27:59 +01:00
Bastian Kleineidam
2eb6b1b44c Call connect() on unconnected ssl responses. 2014-03-18 23:27:21 +01:00
Bastian Kleineidam
fc73c6ca6e Log number of checked unique URLs. 2014-03-14 23:46:17 +01:00
Bastian Kleineidam
91c6e1d29f Don't log bytes in status. 2014-03-14 22:25:19 +01:00
Bastian Kleineidam
34bdf5c75a Updated copyright and docs. 2014-03-14 22:09:05 +01:00
Bastian Kleineidam
19b8baf08c Move cached queue items to top once in a while. 2014-03-14 22:08:51 +01:00
Bastian Kleineidam
6437f08277 Display downloaded bytes. 2014-03-14 21:06:10 +01:00
Bastian Kleineidam
c51caf1133 Assertions should be earlier. 2014-03-14 20:26:11 +01:00
Bastian Kleineidam
cc401923ac Improve wording of status message. 2014-03-14 20:25:37 +01:00
Bastian Kleineidam
cfff4c4a84 Disable URL length warning for data: URLs. 2014-03-14 20:24:28 +01:00
Bastian Kleineidam
ac78c6d5b8 Internal errors do not stop the checking thread any more. 2014-03-14 20:23:04 +01:00
Bastian Kleineidam
b18854649d Count unique URLs for url queue limit. 2014-03-14 20:21:46 +01:00
Bastian Kleineidam
257644e660 Add cache length function to get number of cached elements. 2014-03-14 20:19:34 +01:00
Bastian Kleineidam
306979abca Add HttpHeaderInfo plugin 2014-03-12 19:28:37 +01:00
Bastian Kleineidam
279db5c5b8 Fix documentation. 2014-03-12 19:22:18 +01:00
Bastian Kleineidam
ccd0d4ead7 Updated the list of unknown or ignored URI schemes. 2014-03-12 19:20:49 +01:00
Bastian Kleineidam
121602df87 Use SSL cert on Windows systems. 2014-03-11 20:58:16 +01:00
Bastian Kleineidam
0ad5969b54 Simplify config dir functions. 2014-03-11 20:23:49 +01:00
Bastian Kleineidam
41d07729bb Install certificate store with installers. 2014-03-10 22:34:37 +01:00
Bastian Kleineidam
ee0717131d Add marker for http debugging 2014-03-10 20:09:05 +01:00
Bastian Kleineidam
9c9cf0c3e2 Check for Python requests >= 2.2.0 2014-03-10 19:31:31 +01:00
Bastian Kleineidam
57edf0923e Updated copyright year 2014-03-10 19:27:22 +01:00
Bastian Kleineidam
bca226c293 Fix assertion checking external links; fix tests 2014-03-10 18:23:44 +01:00
Bastian Kleineidam
40b663cf9e Ignore URLs earlier. 2014-03-10 18:05:11 +01:00
Bastian Kleineidam
6b334dc79b Fix URL result caching. 2014-03-08 19:35:10 +01:00
Bastian Kleineidam
0113f06406 Enable arbitrary output encodings in CSV output. See #467 2014-03-06 22:40:52 +01:00
Bastian Kleineidam
102837b875 Set maximum redirects 2014-03-06 21:58:35 +01:00
Bastian Kleineidam
fab2c2da98 Improve content type setting. 2014-03-05 20:12:19 +01:00
Bastian Kleineidam
ef13a3fce1 Implement sitemap and sitemap index parsing. 2014-03-05 19:26:37 +01:00
Bastian Kleineidam
b72cf252fb Move parseable check down since it might get the content. 2014-03-05 19:26:05 +01:00
Bastian Kleineidam
9ef65cb774 Fix UrlData string representation. 2014-03-05 19:25:40 +01:00
Bastian Kleineidam
00bd549c0c Remove duplicate content type map. 2014-03-05 19:24:58 +01:00
Bastian Kleineidam
380f14453b Fix mimetype guessing from content. 2014-03-05 19:23:58 +01:00
Bastian Kleineidam
192cfab009 Cleanup of the UrlData.is_* functions 2014-03-05 19:23:16 +01:00
Bastian Kleineidam
b17211f162 Set for release. 2014-03-04 21:36:24 +01:00
Bastian Kleineidam
978b24f2d7 Merge branch 'caching' 2014-03-04 07:21:42 +01:00
Bastian Kleineidam
f1076c8813 Increase url-too-long warning. 2014-03-03 23:31:04 +01:00
Bastian Kleineidam
82f81241fd Check all links and add better caching. 2014-03-03 23:29:45 +01:00
Bastian Kleineidam
510af337c1 Improved --version output. 2014-03-01 21:00:16 +01:00
Bastian Kleineidam
74d804ac82 Print release date on --version and internal errors. 2014-03-01 20:59:00 +01:00
Bastian Kleineidam
39df1812c7 Default to 10 threads instead of 100. 2014-03-01 20:49:06 +01:00
Bastian Kleineidam
6f205a2574 Support checking Sitemap: URLs in robots.txt files. 2014-03-01 20:25:19 +01:00
Bastian Kleineidam
0f0d79c7e0 Remove crawl-delay stuff 2014-03-01 20:01:42 +01:00
Bastian Kleineidam
00f8011709 Catch overflowerror in robots.txt crawl-delay 2014-03-01 19:58:22 +01:00
Bastian Kleineidam
0e4d6f6e1a Parse sitemap urls in robots.txt files. 2014-03-01 19:57:57 +01:00
Bastian Kleineidam
78a99717fe Check regular expressions from users for errors. 2014-03-01 19:15:48 +01:00
Bastian Kleineidam
c20005a031 Add missing docstring. 2014-03-01 19:14:43 +01:00
Bastian Kleineidam
39c39b1d9f Disable twill page refresh. 2014-03-01 18:19:29 +01:00
Bastian Kleineidam
0211529d79 Use twill form field number if all else fails. 2014-03-01 18:12:06 +01:00
Bastian Kleineidam
7d84e1e729 Do not check permissions on non-posix systems for now. 2014-03-01 18:01:08 +01:00
Bastian Kleineidam
eb7e52c0e2 -o none sets exit code now 2014-03-01 15:31:39 +01:00
Bastian Kleineidam
f7f5001256 Add missing column name to SQL insert statement. 2014-03-01 12:03:33 +01:00
Bastian Kleineidam
f9bf831804 Remove some empty lines 2014-03-01 12:02:00 +01:00
Bastian Kleineidam
900e04ceda Dynamic language switch in the GUI. 2014-03-01 12:01:47 +01:00
Bastian Kleineidam
9d0255e156 Fix bookmark imports 2014-03-01 10:16:29 +01:00
Bastian Kleineidam
7b34be590b Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements. 2014-03-01 00:12:34 +01:00
Bastian Kleineidam
c806be5c15 Updated copyright 2014-01-08 22:33:04 +01:00
Bastian Kleineidam
c076e312a2 Send an Accept header. 2014-01-08 19:56:00 +01:00
Bastian Kleineidam
f3b435c2a6 Add missing docstrings. 2013-12-24 07:15:31 +01:00
Bastian Kleineidam
e0a2558b2b Updated copyright. 2013-12-24 07:13:16 +01:00
Bastian Kleineidam
845a6a1146 Fix loader in frozen executables. 2013-12-18 20:53:17 +01:00
wummel
9646f0b652 Merge pull request #418 from chuckbjones/reset-url-on-fallback
Reset to original url when falling back to GET
2013-12-17 22:37:17 -08:00
Bastian Kleineidam
fbbced4d8f Fix tests 2013-12-13 07:39:59 +01:00
Bastian Kleineidam
5151e68a3e Fix logger config 2013-12-13 07:37:21 +01:00
Bastian Kleineidam
103e00b4d1 Allow disabling of ssl certificate checks. 2013-12-12 22:17:57 +01:00
Bastian Kleineidam
39fb02f9a9 Remember last save result as filetype. 2013-12-12 20:44:09 +01:00
Bastian Kleineidam
5736987b60 Refactor output loggers. 2013-12-11 18:41:55 +01:00
Bastian Kleineidam
78ed1e9e52 Do not GET on POST forms. 2013-12-10 23:42:43 +01:00
Bastian Kleineidam
0ca63797bf Remove content cache. 2013-12-10 23:41:52 +01:00
Bastian Kleineidam
a7c1cdd6f6 Check for help files. 2013-12-10 20:56:26 +01:00
Bastian Kleineidam
2c5ede2eb7 Fallback to GET for Apache Coyote servers. 2013-12-08 08:22:56 +01:00
Bastian Kleineidam
b567f766ba Fix strtime test. 2013-12-06 07:13:44 +01:00
Bastian Kleineidam
6d68e00068 Merge branch 'master' of github.com:wummel/linkchecker 2013-12-04 19:21:45 +01:00
Bastian Kleineidam
023da7c993 Remove the duplicate URL content check. 2013-12-04 19:12:40 +01:00
Bastian Kleineidam
36badddfac Update cookie code from Python module. 2013-12-04 19:05:08 +01:00
wummel
ab54809d95 Merge pull request #426 from alperkokmen/fix-lastmod-format
Fix ISO formatting for modified datetime.
2013-12-03 12:22:27 -08:00
Bastian Kleineidam
c676a4c829 Avoid DoS in SSL certificate host matching. 2013-11-30 22:07:23 +01:00
Alper Kokmen
4b3e78cac0 Fix ISO formatting for modified datetime.
This change will make sure that format_modified returns datetime value
in ISO 8601 format. See W3C documentation at
http://www.w3.org/TR/NOTE-datetime.

Since ```modified``` is parsed and then converted to UTC after it's
extracted from HTTP response, it's safe to assume that format_modified
will always format UTC datetime values.

Instead of ```isoformat``` method which omits timezone information for
UTC values, ```strftime``` with a specific format (that ends with Z)
will be used.
2013-09-02 15:38:54 -07:00
Charles Jones
4294633c04 Close connection prior to falling back to get, since we change the url back to the original at that time. 2013-08-09 13:08:51 -05:00
Charles Jones
8bc138f18b Reset to original url when falling back to GET 2013-07-30 13:38:59 -05:00
Bastian Kleineidam
c966fe6b24 Remove the http-wrong-redirect warning 2013-04-11 18:33:19 +02:00
Bastian Kleineidam
134db22830 Updated homepage URL. 2013-04-09 20:11:04 +02:00
Bastian Kleineidam
21678c661d Updated gzip and httplib copies. 2013-03-11 20:21:58 +01:00
Bastian Kleineidam
6b05f1d290 Paginate help output again. 2013-02-28 21:21:00 +01:00
Bastian Kleineidam
123578a4cd Make per-host connection limits configurable. 2013-02-27 19:37:28 +01:00
Bastian Kleineidam
b7c82d1e75 Fix strformat.strsize() test. 2013-02-27 19:36:03 +01:00
Bastian Kleineidam
b38317d57b Replace optparse with argparse. 2013-02-27 19:35:44 +01:00
Bastian Kleineidam
64d95e45e0 Remove local HTML and CSS syntax check. 2013-02-08 21:36:02 +01:00
Bastian Kleineidam
b104482174 Add missing docstring. 2013-01-25 21:15:12 +01:00
Bastian Kleineidam
35bc79dd90 Updated copyright. 2013-01-25 21:14:27 +01:00
Bastian Kleineidam
707b7b7db1 Close HTTP connections without body content. Github issue #376 2013-01-23 19:42:29 +01:00
Bastian Kleineidam
e6ad32c028 Catch UnicodeError for invalid host names. 2013-01-23 19:42:29 +01:00
Bastian Kleineidam
c0a0efbd1d Do not handle non-existing SIGUSR1 signal. 2013-01-22 21:23:46 +01:00
Bastian Kleineidam
47451d7def Fix GUI drag and drop. 2013-01-22 19:06:10 +01:00
Bastian Kleineidam
faa743e876 Increase per-host connection limits. 2013-01-22 18:18:48 +01:00
Bastian Kleineidam
fa402c0d70 Allow drag-and-drop of all local files. 2013-01-22 18:17:07 +01:00
Bastian Kleineidam
7134c0bb05 Print thread stack traces on SIGUSR1 2013-01-22 18:16:53 +01:00
Bastian Kleineidam
9b8cb67d78 Updated copyright. 2013-01-17 20:41:47 +01:00
Bastian Kleineidam
4dad2aa33c Support dns-prefetch URLs. 2013-01-17 20:41:09 +01:00
Bastian Kleineidam
7fe72745ae Updated copyright. 2013-01-09 23:03:12 +01:00
Bastian Kleineidam
fe7e9a5c6c Improve Word document opening: open read-only and invisble, avoiding unnecessary dialogs. 2013-01-07 22:18:39 +01:00
Bastian Kleineidam
a5b6136e70 Check word document validity before closing. 2013-01-07 21:58:02 +01:00
Bastian Kleineidam
0e50834f9a Rename external module to exclude it from some style checks. 2013-01-06 18:17:29 +01:00
Bastian Kleineidam
65a0031c10 Updated copyright. 2013-01-06 18:12:44 +01:00
Bastian Kleineidam
16b84be490 Updated all links. 2013-01-06 18:10:13 +01:00
Bastian Kleineidam
0283362ce6 Updated copyright. 2012-12-23 21:32:16 +01:00
Bastian Kleineidam
a7b83e6200 Fix GUI startup for Windows. 2012-12-19 21:12:02 +01:00
Bastian Kleineidam
9820530313 Use better_exchook to print more internal error info. 2012-12-18 23:06:48 +01:00
Bastian Kleineidam
f568a04a7c Fix ignore option storing in GUI. 2012-12-13 17:06:06 +01:00
Bastian Kleineidam
27df4e20da Add error handling for screen console function. 2012-12-07 22:31:48 +01:00
Bastian Kleineidam
efbbb656a1 Remove python-dns conflict by moving the dns module into a custom subdirectory. 2012-12-07 22:19:32 +01:00
Bastian Kleineidam
45a4bbdaa9 Use locale.format() and os.path.getsize() 2012-12-01 00:05:14 +01:00
Bastian Kleineidam
42a17cbb98 Prepare py3 port and display sys.argv on internal errors. 2012-11-26 18:49:07 +01:00
Bastian Kleineidam
ec03d56b62 Remove pysqlite dependency. 2012-11-14 20:23:56 +01:00
Bastian Kleineidam
7ae1eadadb Improve http status 305 code message. 2012-11-13 18:13:36 +01:00
Bastian Kleineidam
cd4abb1f12 Improve repr() of url data, and remove alexa test script. 2012-11-09 19:09:38 +01:00
Bastian Kleineidam
f3e52f1176 loginpasswordfield is not a password 2012-11-08 22:11:35 +01:00
Bastian Kleineidam
e5735e2a5d Fix URL queue handling. 2012-11-08 12:48:21 +01:00
Bastian Kleineidam
96c6a7f378 Display portable flag in about dialog. 2012-11-08 11:59:20 +01:00
Bastian Kleineidam
bc683577de Remove URLs from the in_progress cache. 2012-11-08 11:03:16 +01:00
Bastian Kleineidam
810a62e093 Fix file url checking. 2012-11-07 19:37:16 +01:00
Bastian Kleineidam
2d6cfb238f Add trailing dot when creating user configuration directory on Windows. 2012-11-07 18:22:07 +01:00
Bastian Kleineidam
b0c2a90b94 Updated copyright. 2012-11-07 18:08:44 +01:00
Bastian Kleineidam
f9a7f5ef96 Restrict local file checking. 2012-11-07 18:07:00 +01:00
Bastian Kleineidam
02ec94dbfb Improve cancel message. 2012-11-06 21:54:09 +01:00
Bastian Kleineidam
eabaa41bd2 Do not check duplicate URLs. 2012-11-06 21:34:22 +01:00
Bastian Kleineidam
ae5f9e8801 Print active threads in debug level. 2012-11-06 21:33:43 +01:00
Bastian Kleineidam
9745be9d71 Fix cookie path matching with empty paths. 2012-10-30 17:44:00 +01:00
Bastian Kleineidam
e2fd37b886 Encode user and password for telnet connection. 2012-10-30 17:44:00 +01:00
Bastian Kleineidam
c6d8b0050e Improve PHP command check. 2012-10-29 21:05:26 +01:00
Bastian Kleineidam
e8da486d66 Detect redirection errors when getting content. 2012-10-26 18:05:00 +02:00
Bastian Kleineidam
2390827735 Debug cookies. 2012-10-25 17:53:16 +02:00
Bastian Kleineidam
c44aa2db1f Fix anchor checking of cached HTTP URLs by using the cached content type. 2012-10-25 06:37:10 +02:00
Bastian Kleineidam
dca52145d3 Misc stuff. 2012-10-24 22:59:28 +02:00
Bastian Kleineidam
b39158e65c Improve available anchor message. 2012-10-24 22:21:46 +02:00
Bastian Kleineidam
dd2c963fac Fix non-ASCII exception handling. 2012-10-24 22:14:45 +02:00
Bastian Kleineidam
64de760b97 Added debug statements for unparseable content types. 2012-10-24 22:06:42 +02:00
Bastian Kleineidam
3a51ac7662 Warn about accessible passwords in config files. 2012-10-15 14:36:10 +02:00
Bastian Kleineidam
8750d55a73 Add configuration entry for maximum number of URLs. 2012-10-14 11:13:55 +02:00
Bastian Kleineidam
2ebedbaaa6 Fix content reading. 2012-10-13 16:48:29 +02:00
Bastian Kleineidam
0e4e694ad1 Fix connection handling on redirects. 2012-10-13 13:36:43 +02:00
Bastian Kleineidam
3b5877161c Improved debugging. 2012-10-13 13:36:28 +02:00
Bastian Kleineidam
d3b44be2c4 Improved documentation. 2012-10-13 12:03:19 +02:00
Bastian Kleineidam
7929a48d78 Fix url split with invalid port names. 2012-10-13 12:03:09 +02:00
Bastian Kleineidam
aa057bd36f Fix colorama init error. 2012-10-12 20:39:34 +02:00
Bastian Kleineidam
6a204120b6 Handle stale file system links for local file checks. 2012-10-12 17:20:19 +02:00
Bastian Kleineidam
c4e15c7b88 Improved duplication url check. 2012-10-10 21:04:48 +02:00
Bastian Kleineidam
b758fc6f52 Reuse existing response. 2012-10-10 12:27:36 +02:00
Bastian Kleineidam
a0610310b4 Print debug on stderr. 2012-10-10 12:27:25 +02:00
Bastian Kleineidam
0c20ef5de4 Strip console characters only from line text. 2012-10-10 12:27:08 +02:00
Bastian Kleineidam
e1e80b7dd5 Remove addrinfo cache. 2012-10-10 10:54:58 +02:00
Bastian Kleineidam
20be0f2519 Strip control chars from logger output. 2012-10-10 10:54:30 +02:00
Bastian Kleineidam
f484a6776d Use timeout value from configuration. 2012-10-10 10:53:52 +02:00
Bastian Kleineidam
871508ef5d Add docs and updated copyright. 2012-10-10 06:53:16 +02:00
Bastian Kleineidam
63cf8adf54 Catch ValueError on invalid cookie expiration dates. 2012-10-10 06:44:38 +02:00
Bastian Kleineidam
06a25676c5 Only read the maximum data size plus one, not the whole file. 2012-10-10 06:35:33 +02:00
Bastian Kleineidam
3e1d51b8bf Use RLock to simplify internal locking. 2012-10-09 21:11:35 +02:00
Bastian Kleineidam
c4cd66ea1b Simplify decorator duration check logic. 2012-10-09 21:05:24 +02:00
Bastian Kleineidam
03a5d476b3 Use URL name if title is empty. 2012-10-09 21:04:54 +02:00
Bastian Kleineidam
6d47b76509 Limit HTTP and FTP connections. Gets rid of spurious BadStatusLine errors. 2012-10-09 21:04:20 +02:00
Bastian Kleineidam
7d3ece502c Support semaphores. 2012-10-09 19:46:06 +02:00
Bastian Kleineidam
ad8525c483 Improve BadStatusline error message. 2012-10-05 08:32:24 +02:00
Bastian Kleineidam
d15fafb1f7 Code cleanup. 2012-10-05 08:10:44 +02:00
Bastian Kleineidam
5ebd754cdb Improved duplicate url check. 2012-10-01 16:11:45 +02:00
Bastian Kleineidam
ed7c60e491 Do not warn about duplicate URLs which can point to the same content. 2012-10-01 13:42:46 +02:00
Bastian Kleineidam
148846be67 Add flag to log lock contentions. 2012-10-01 13:32:30 +02:00
Bastian Kleineidam
b56c054932 Use finer-grained robots.txt locks to improve lock contention. 2012-10-01 13:29:29 +02:00
Bastian Kleineidam
27b61c3bfa Fix gzip handling in http content decoder. 2012-09-30 14:00:49 +02:00
Bastian Kleineidam
cbc3bcb0d3 Sitemap logger fixes. 2012-09-23 23:20:21 +02:00
Bastian Kleineidam
60305d8877 Code cleanup. 2012-09-23 21:20:12 +02:00
Bastian Kleineidam
e21187b275 Put in-progress URLs back near the front of URL queue, not at end. 2012-09-23 21:00:01 +02:00
Bastian Kleineidam
1f3034b5f5 Sitemap logger fixes. 2012-09-23 20:59:38 +02:00
Bastian Kleineidam
38dd63f055 Code cleanup. 2012-09-23 16:19:42 +02:00
Bastian Kleineidam
7f8fd01b22 Add Accept-Encoding and Accept-Charset headers. 2012-09-23 15:06:44 +02:00
Bastian Kleineidam
03ecff22bb Fix endless loop in http authentication. 2012-09-22 22:21:10 +02:00
Bastian Kleineidam
653b5f27dd Updated ignored schemes. 2012-09-22 16:18:37 +02:00
Bastian Kleineidam
1c59cb4d4c Use GET in case a HEAD method does not succeed, even if robots.txt content checkes denied the page. This way proper check results are achieved (but the content is still not checked, so it's ok). 2012-09-22 07:53:11 +02:00
Bastian Kleineidam
fba465e8e8 Fix robotstxt cache miss stats. 2012-09-21 21:12:28 +02:00
Bastian Kleineidam
f6b007f757 Fix useragent matching in robots.txt parser. 2012-09-21 21:12:13 +02:00
Bastian Kleineidam
bbf25106fa Fix double result setting on http checks. 2012-09-21 20:33:15 +02:00
Bastian Kleineidam
3e464e509c Do not allow empty configuration string values. 2012-09-21 16:05:34 +02:00
Bastian Kleineidam
ecf8753a19 Improved user-agent string similar to Google and Bing search bots. 2012-09-21 15:46:14 +02:00
Bastian Kleineidam
c274b50c50 Store lowercase URL scheme in checker class. 2012-09-21 14:35:25 +02:00
Bastian Kleineidam
0941f6ff02 Improve exception handling by using unicode. 2012-09-21 14:29:20 +02:00
Bastian Kleineidam
f46889a4af Log timestamps in debug output. 2012-09-21 13:05:36 +02:00
Bastian Kleineidam
049882e4fe Remove accept-encoding since some sites have wrong compression. 2012-09-20 22:39:15 +02:00
Bastian Kleineidam
7c6dce6136 Only warn non-empty site duplicates. 2012-09-20 20:39:36 +02:00
Bastian Kleineidam
a03090c20f Optimize intern/extern pattern parsing. 2012-09-20 20:19:13 +02:00
Bastian Kleineidam
c385c35b1a Fix ansicolor again. 2012-09-20 16:39:40 +02:00
Bastian Kleineidam
b9d234c78a Fix wrong method name in SSL certificate check. 2012-09-20 16:28:01 +02:00
Bastian Kleineidam
bff217c58b Never log ignored warnings. 2012-09-20 12:44:40 +02:00
Bastian Kleineidam
600b7c0e69 Fix duplicate content warning when self.size is not set yet. 2012-09-20 12:44:23 +02:00
Bastian Kleineidam
9cfee5eb5b Improved color detection with curses. 2012-09-20 12:13:15 +02:00
Bastian Kleineidam
bc0a17c1c4 Display last modified date in the GUI. 2012-09-19 21:23:39 +02:00
Bastian Kleineidam
d37347cab0 Remove unused variable. 2012-09-19 11:08:06 +02:00
Bastian Kleineidam
18a200d85f Fix tests. 2012-09-19 11:05:26 +02:00
Bastian Kleineidam
b8f8bdf5fc Fix last modified formatting. 2012-09-19 10:09:19 +02:00
Bastian Kleineidam
f5fbd7666f Remove unused import. 2012-09-19 09:39:32 +02:00
Bastian Kleineidam
75719b34f6 Updated copyright. 2012-09-19 09:17:25 +02:00
Bastian Kleineidam
71fba0f8b7 Log all valid URLs in sitemap loggers. 2012-09-19 09:17:08 +02:00
Bastian Kleineidam
9d1c90f96c Write extra script to analyse a memory dump. 2012-09-18 16:08:31 +02:00
Bastian Kleineidam
3a352631ba Add modified field to loggers. 2012-09-18 12:12:00 +02:00
Bastian Kleineidam
1db63227f6 Memoize file operations to minimize disk I/O. 2012-09-18 09:37:21 +02:00
Bastian Kleineidam
932a07a9cf Added XML sitemap logger. 2012-09-18 09:16:34 +02:00
Bastian Kleineidam
4e59056ee7 Warn about duplicate URL contents. 2012-09-17 19:49:50 +02:00
Bastian Kleineidam
02a09dbb28 Add documentation. 2012-09-17 16:30:32 +02:00
Bastian Kleineidam
99bf8aa940 Updated copyright. 2012-09-17 16:09:55 +02:00
Bastian Kleineidam
cb71f483a5 Warn about too long URLs. 2012-09-17 16:00:23 +02:00
Bastian Kleineidam
03667a4ec9 Print warning tags in text output. 2012-09-17 15:29:04 +02:00
Bastian Kleineidam
1f9ee987f9 Improved terminal color detection with curses. 2012-09-17 15:24:04 +02:00
Bastian Kleineidam
6e1841cf1f Print download and cache statistics. 2012-09-17 15:23:25 +02:00