Commit graph

268 commits

Author SHA1 Message Date
Bastian Kleineidam
3a352631ba Add modified field to loggers. 2012-09-18 12:12:00 +02:00
Bastian Kleineidam
4e59056ee7 Warn about duplicate URL contents. 2012-09-17 19:49:50 +02:00
Bastian Kleineidam
6e1841cf1f Print download and cache statistics. 2012-09-17 15:23:25 +02:00
Bastian Kleineidam
273230d98b Send HTTP Do-Not-Track header. 2012-09-14 22:41:38 +02:00
Bastian Kleineidam
7a6436f08f Increase checked cache in URL queue. 2012-09-02 22:21:49 +02:00
Bastian Kleineidam
4c16d3e702 Make 401 unauthorized GET response a warning. 2012-08-26 11:32:17 +02:00
Bastian Kleineidam
ae15d51b30 Translate more result strings. 2012-08-23 23:59:33 +02:00
Bastian Kleineidam
ce4253263c Do not special case http->ftp redirects. 2012-08-23 23:56:36 +02:00
Bastian Kleineidam
7374068941 Remove unused import. 2012-08-23 16:46:14 +02:00
Bastian Kleineidam
73d64e50ab Fix redirection to new scheme. 2012-08-23 16:45:24 +02:00
Bastian Kleineidam
bc287d7710 Make unauthorized access responses with missing www-authenticate headers an error. 2012-08-23 15:52:11 +02:00
Bastian Kleineidam
e252bbf623 Remove Amazon quirk because the default behaviour handles this now. 2012-08-23 05:36:51 +02:00
Bastian Kleineidam
ecef16b2c9 Support WML sites. 2012-08-22 22:43:14 +02:00
Bastian Kleineidam
6915e2f989 Detect sites not supporting HEAD requests. 2012-08-14 18:43:39 +02:00
Bastian Kleineidam
f3b66b102d Fallback to GET when method HEAD is not allowed. 2012-08-13 07:07:21 +02:00
Bastian Kleineidam
6be3e9ddff Cleanup code and improve redirect anchor handling. 2012-08-12 11:14:56 +02:00
Bastian Kleineidam
5c045fef44 Fix UNC path handling on Windows. 2012-06-24 10:30:54 +02:00
Bastian Kleineidam
cbb13a8983 Add SSL certificate verification. 2012-06-18 23:05:44 +02:00
Bastian Kleineidam
f107092a8a Fix handling of user/password info in URLs. 2012-06-10 22:07:42 +02:00
Bastian Kleineidam
2dee223555 Allow memory dumps to be written. 2012-06-10 13:18:35 +02:00
Bastian Kleineidam
54ffb102d8 Code cleanup: add function for GET fallback. 2012-06-10 09:52:12 +02:00
Bastian Kleineidam
5c94c47901 Remove old Squid proxy workaround. 2012-06-10 09:45:07 +02:00
Bastian Kleineidam
bcbacec79a Code cleanup. 2012-05-10 21:05:33 +02:00
Bastian Kleineidam
61138744e6 Always use GET for Zope servers. 2012-05-08 20:47:47 +02:00
Bastian Kleineidam
797024c69b Fix URL connection cache key. 2012-04-04 22:58:09 +02:00
Bastian Kleineidam
4feea986b4 Fix concatenation of multiple cookie values. 2012-03-31 08:51:58 +02:00
Bastian Kleineidam
da6d7b0eca Store cookies on redirect. 2012-03-31 08:37:18 +02:00
Bastian Kleineidam
b9b8e3f5b2 Honor the charset encoding of the Content-Type HTTP
header when parsing HTML.
2012-03-22 22:45:11 +01:00
Bastian Kleineidam
5e13a78f66 Fix non-ascii HTTP header debugging. 2012-03-09 11:54:18 +01:00
Bastian Kleineidam
3fcff8a4e5 Fix non-ascii HTTP header handling. 2012-03-09 11:14:18 +01:00
Bastian Kleineidam
24811ac7b0 Recheck extern status on HTTP redirects even if domain did not change. 2012-03-08 10:07:31 +01:00
Bastian Kleineidam
71f5ee42c8 Updated copyright. 2012-01-29 17:18:28 +01:00
Bastian Kleineidam
6e1e9148d8 Work around a squid bug resulting in not detecting broken links 2012-01-17 08:36:11 +01:00
Bastian Kleineidam
4c15fc6a8b Properly handle non-ASCII HTTP header values. 2012-01-14 11:01:09 +01:00
Bastian Kleineidam
cdf91a0321 Improve cookie info message and fix cookie test cases. 2011-08-04 18:34:56 +02:00
Bastian Kleineidam
48413de418 Display warning message for each cookie parsing error. 2011-08-03 19:27:36 +02:00
Bastian Kleineidam
c99b75899d Send multiple cookie values in one header. 2011-08-02 21:57:16 +02:00
Bastian Kleineidam
c70bd68ef1 Refactor sending of cookie data in client into separate function. 2011-08-02 20:45:26 +02:00
Bastian Kleineidam
51bcccfdfe Added new option --user-agent to set the User-Agent header. 2011-07-25 21:09:49 +02:00
Bastian Kleineidam
552c71a3ca Do not append a stray newline character when encoding authentication information to base64. 2011-07-25 20:02:01 +02:00
Bastian Kleineidam
5515645af6 Reset content type setting after loading HTTP headers. 2011-05-28 17:59:44 +02:00
Bastian Kleineidam
03feaeca91 Correct warning about unparsable cookies. 2011-05-18 20:56:31 +02:00
Bastian Kleineidam
10bbb696e8 Limit download file size to 5MB. 2011-05-05 21:10:55 +02:00
Bastian Kleineidam
1f9cd2f67f Redirection refactoring part 2 of 2. 2011-04-27 13:33:01 +02:00
Bastian Kleineidam
dd53c78096 Redirection refactoring part 1. 2011-04-27 12:02:30 +02:00
Bastian Kleineidam
f566f98fe5 Allow redirections for URLs given by the user. 2011-04-27 11:21:58 +02:00
Bastian Kleineidam
6a544f2d69 Only allow redirections to FTP, HTTP and HTTPS URLs. 2011-04-19 07:01:55 +02:00
Bastian Kleineidam
de5d1757f0 Add workaround for buggy IIS HEAD support. 2011-02-24 11:12:59 +01:00
Bastian Kleineidam
2dfe62afa2 Updated copyright. 2011-02-14 21:07:07 +01:00
Bastian Kleineidam
c5884b8d87 Add function documentation. 2011-02-14 21:06:34 +01:00
Bastian Kleineidam
fd3fe8dcaa Fix missing content types for cached URLs. 2010-12-23 07:37:36 +01:00
Bastian Kleineidam
7c55351511 Add get_content_type methods to subclasses. 2010-12-15 07:54:44 +01:00
Bastian Kleineidam
01184784ef Remove warning about Unicode domains which are more widely supported now. 2010-12-11 07:58:15 +01:00
Bastian Kleineidam
6fac69cddb Fall back to GET when connection is reset. 2010-11-21 19:50:51 +01:00
Bastian Kleineidam
147bf31e1e Check for allowed HTTP GET method before parsing anchors in HTML file contents. 2010-11-17 19:13:26 +01:00
Bastian Kleineidam
4f5c957e43 Fix check of external domain after HTTP redirect. 2010-11-06 18:00:49 +01:00
Bastian Kleineidam
23b20306e9 Remove duplicate HTTP response codes. 2010-11-01 09:27:53 +01:00
Bastian Kleineidam
c5f93a561d Fix debug message formatting. 2010-11-01 05:59:04 +01:00
Bastian Kleineidam
f14340a0a8 Do not check content of already cached URLs. 2010-10-27 19:52:48 +02:00
Bastian Kleineidam
1f81124dfa Fix typo. 2010-10-27 19:23:14 +02:00
Bastian Kleineidam
23403f09bb Do not print warning for HTTP to HTTPS or HTTPS to HTTP redirects. 2010-10-27 14:44:05 +02:00
Bastian Kleineidam
b2cf40151f Improved redirection warning text. 2010-10-27 09:15:46 +02:00
Bastian Kleineidam
d9e981e497 Don't log a warning if commandline URL has been redirected. 2010-10-26 16:24:27 +02:00
Bastian Kleineidam
4375d35328 Add warning about unsupported HTTP authentication, and revert the realm changes. 2010-10-25 22:41:31 +02:00
Bastian Kleineidam
2a7292845c Improved info message about sent cookies; do not report the retrieved cookie information. 2010-10-13 22:32:50 +02:00
Bastian Kleineidam
a8aa3bdb00 Another fix to ensure get_content() is only called when allowed. 2010-10-13 22:14:43 +02:00
Bastian Kleineidam
61e611e4bf Prevent unallowed content read when checking for robots.txt allowance in HTML files. 2010-10-12 00:40:34 +02:00
Bastian Kleineidam
e494d6bbb6 Move MIME type detection into fileutil.py module, and use mimetools for detection. 2010-10-03 08:47:48 +02:00
Bastian Kleineidam
e0f4097eb0 Ensure HttpUrl.set_title_from_content() is only called when the content is allowed to be retrieved. 2010-09-29 19:26:03 +02:00
Bastian Kleineidam
5284017d67 Only fallback to HTTP GET when robots.txt sallows it. 2010-09-04 18:09:59 +02:00
Bastian Kleineidam
60f7af4598 Allow redirections to external URLs with same domain. 2010-08-13 01:22:18 +02:00
Bastian Kleineidam
1faedafb33 Fix data size for HTTP requests. 2010-08-04 00:06:25 +02:00
Bastian Kleineidam
7ad4f7c220 Compare size from meta info and content data. 2010-07-29 19:53:41 +02:00
Bastian Kleineidam
7536472797 Send correct host header when using http proxy. 2010-07-29 06:50:35 +02:00
Bastian Kleineidam
3370ea1562 Reflect changes in httplib2.py: use buffered read in httplib response object and use bad status line exception attribute. 2010-03-26 20:50:38 +01:00
Bastian Kleineidam
b8b0398dd2 Ensure redirected URL is Unicode encoded. 2010-03-07 22:11:55 +01:00
Bastian Kleineidam
c8e6995ecd Support HTTPS proxies. 2010-03-07 21:06:10 +01:00
Bastian Kleineidam
6a2fcf8ae9 Parse links in Word files. 2010-03-07 19:20:51 +01:00
Bastian Kleineidam
3d5c114f14 Warn on permament redirections even when URL is outside of domain filter. 2010-03-07 09:36:21 +01:00
Bastian Kleineidam
2d73b907f1 Retry HTTP when server sent empty status line; should fix most of the BadStatusLine errors that are sporadically encountered. 2010-03-06 10:23:34 +01:00
Bastian Kleineidam
5e06b6b8d4 Updated FSF address in GPL blurb 2009-07-24 23:58:20 +02:00
Bastian Kleineidam
7f67027abf ignore the fragment part (ie. the anchor) of URIs when
+  getting and caching content
2009-06-26 07:22:36 +02:00
Bastian Kleineidam
897b68ae9b Fix copying of httpurl info 2009-03-07 00:17:17 +01:00
Bastian Kleineidam
29adfe92fd Minor syntax fix 2009-03-06 20:14:50 +01:00
Bastian Kleineidam
6024f2e43e Add missing reset of self.reused_connection flag 2009-03-06 20:10:03 +01:00
Bastian Kleineidam
58925b21d3 Improved persistent connection handling by retrying closed connections. 2009-03-06 08:15:34 +01:00
Bastian Kleineidam
29599e4c74 Make sure persistent connection will not close after reading contents. 2009-03-05 19:15:44 +01:00
Bastian Kleineidam
bf9ed8c659 Make sure file descriptors are closed after decoding HTTP content. 2009-03-05 19:15:03 +01:00
Bastian Kleineidam
7862147ca3 Fix showing content size. 2009-03-01 23:04:48 +01:00
calvin
e9805dbd8a Updated copyright year to 2009
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3887 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2009-01-08 14:18:03 +00:00
calvin
209d5abc18 fix timeouts by testing earlier for persistent connections with HEAD
HEAD requests never have a body; nevertheless the http lib tries to
read() from them. This times out on some servers of course. Fix is
not to let those connections be persistent.

git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3871 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-11-29 08:14:28 +00:00
calvin
c20e706761 Made some format changes on translated strings.
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3870 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-11-28 20:22:48 +00:00
calvin
c3b6fc5aa4 Readd
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3867 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-11-20 21:30:10 +00:00
calvin
97cf700e04 Fixed wrong cookie debugging format line.
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3849 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-07-13 12:51:56 +00:00
calvin
b30fb3b09c Remove duplicate code in http checker.
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3820 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-06-16 19:52:09 +00:00
calvin
caf8ba6297 Really allow parsing of XHTML files; I forgot some places to adjust the MIME checking.
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3818 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-06-16 13:03:48 +00:00
calvin
a6deeeb8a5 Support parsing of HTML pages served with content type application/xhtml+xml
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3817 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-06-16 09:39:49 +00:00
calvin
a880939c40 Initialize variables in reset(), not in subsequent methods
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3796 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-06-08 09:27:13 +00:00
calvin
5f4d61e018 Use keyword arguments in translation strings.
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3780 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-05-27 19:44:40 +00:00
calvin
bacb59597e Use relative imports from Python 2.5
git-svn-id: https://linkchecker.svn.sourceforge.net/svnroot/linkchecker/trunk/linkchecker@3750 e7d03fd6-7b0d-0410-9947-9c21f3af8025
2008-05-09 06:16:03 +00:00