Commit graph

2625 commits

Author SHA1 Message Date
Bastian Kleineidam
6d47b76509 Limit HTTP and FTP connections. Gets rid of spurious BadStatusLine errors. 2012-10-09 21:04:20 +02:00
Bastian Kleineidam
7d3ece502c Support semaphores. 2012-10-09 19:46:06 +02:00
Bastian Kleineidam
ad8525c483 Improve BadStatusline error message. 2012-10-05 08:32:24 +02:00
Bastian Kleineidam
d15fafb1f7 Code cleanup. 2012-10-05 08:10:44 +02:00
Bastian Kleineidam
5ebd754cdb Improved duplicate url check. 2012-10-01 16:11:45 +02:00
Bastian Kleineidam
ed7c60e491 Do not warn about duplicate URLs which can point to the same content. 2012-10-01 13:42:46 +02:00
Bastian Kleineidam
148846be67 Add flag to log lock contentions. 2012-10-01 13:32:30 +02:00
Bastian Kleineidam
b56c054932 Use finer-grained robots.txt locks to improve lock contention. 2012-10-01 13:29:29 +02:00
Bastian Kleineidam
27b61c3bfa Fix gzip handling in http content decoder. 2012-09-30 14:00:49 +02:00
Bastian Kleineidam
cbc3bcb0d3 Sitemap logger fixes. 2012-09-23 23:20:21 +02:00
Bastian Kleineidam
60305d8877 Code cleanup. 2012-09-23 21:20:12 +02:00
Bastian Kleineidam
e21187b275 Put in-progress URLs back near the front of URL queue, not at end. 2012-09-23 21:00:01 +02:00
Bastian Kleineidam
1f3034b5f5 Sitemap logger fixes. 2012-09-23 20:59:38 +02:00
Bastian Kleineidam
38dd63f055 Code cleanup. 2012-09-23 16:19:42 +02:00
Bastian Kleineidam
7f8fd01b22 Add Accept-Encoding and Accept-Charset headers. 2012-09-23 15:06:44 +02:00
Bastian Kleineidam
03ecff22bb Fix endless loop in http authentication. 2012-09-22 22:21:10 +02:00
Bastian Kleineidam
653b5f27dd Updated ignored schemes. 2012-09-22 16:18:37 +02:00
Bastian Kleineidam
1c59cb4d4c Use GET in case a HEAD method does not succeed, even if robots.txt content checkes denied the page. This way proper check results are achieved (but the content is still not checked, so it's ok). 2012-09-22 07:53:11 +02:00
Bastian Kleineidam
fba465e8e8 Fix robotstxt cache miss stats. 2012-09-21 21:12:28 +02:00
Bastian Kleineidam
f6b007f757 Fix useragent matching in robots.txt parser. 2012-09-21 21:12:13 +02:00
Bastian Kleineidam
bbf25106fa Fix double result setting on http checks. 2012-09-21 20:33:15 +02:00
Bastian Kleineidam
3e464e509c Do not allow empty configuration string values. 2012-09-21 16:05:34 +02:00
Bastian Kleineidam
ecf8753a19 Improved user-agent string similar to Google and Bing search bots. 2012-09-21 15:46:14 +02:00
Bastian Kleineidam
c274b50c50 Store lowercase URL scheme in checker class. 2012-09-21 14:35:25 +02:00
Bastian Kleineidam
0941f6ff02 Improve exception handling by using unicode. 2012-09-21 14:29:20 +02:00
Bastian Kleineidam
f46889a4af Log timestamps in debug output. 2012-09-21 13:05:36 +02:00
Bastian Kleineidam
049882e4fe Remove accept-encoding since some sites have wrong compression. 2012-09-20 22:39:15 +02:00
Bastian Kleineidam
7c6dce6136 Only warn non-empty site duplicates. 2012-09-20 20:39:36 +02:00
Bastian Kleineidam
a03090c20f Optimize intern/extern pattern parsing. 2012-09-20 20:19:13 +02:00
Bastian Kleineidam
c385c35b1a Fix ansicolor again. 2012-09-20 16:39:40 +02:00
Bastian Kleineidam
b9d234c78a Fix wrong method name in SSL certificate check. 2012-09-20 16:28:01 +02:00
Bastian Kleineidam
bff217c58b Never log ignored warnings. 2012-09-20 12:44:40 +02:00
Bastian Kleineidam
600b7c0e69 Fix duplicate content warning when self.size is not set yet. 2012-09-20 12:44:23 +02:00
Bastian Kleineidam
9cfee5eb5b Improved color detection with curses. 2012-09-20 12:13:15 +02:00
Bastian Kleineidam
bc0a17c1c4 Display last modified date in the GUI. 2012-09-19 21:23:39 +02:00
Bastian Kleineidam
d37347cab0 Remove unused variable. 2012-09-19 11:08:06 +02:00
Bastian Kleineidam
18a200d85f Fix tests. 2012-09-19 11:05:26 +02:00
Bastian Kleineidam
b8f8bdf5fc Fix last modified formatting. 2012-09-19 10:09:19 +02:00
Bastian Kleineidam
f5fbd7666f Remove unused import. 2012-09-19 09:39:32 +02:00
Bastian Kleineidam
75719b34f6 Updated copyright. 2012-09-19 09:17:25 +02:00
Bastian Kleineidam
71fba0f8b7 Log all valid URLs in sitemap loggers. 2012-09-19 09:17:08 +02:00
Bastian Kleineidam
9d1c90f96c Write extra script to analyse a memory dump. 2012-09-18 16:08:31 +02:00
Bastian Kleineidam
3a352631ba Add modified field to loggers. 2012-09-18 12:12:00 +02:00
Bastian Kleineidam
1db63227f6 Memoize file operations to minimize disk I/O. 2012-09-18 09:37:21 +02:00
Bastian Kleineidam
932a07a9cf Added XML sitemap logger. 2012-09-18 09:16:34 +02:00
Bastian Kleineidam
4e59056ee7 Warn about duplicate URL contents. 2012-09-17 19:49:50 +02:00
Bastian Kleineidam
02a09dbb28 Add documentation. 2012-09-17 16:30:32 +02:00
Bastian Kleineidam
99bf8aa940 Updated copyright. 2012-09-17 16:09:55 +02:00
Bastian Kleineidam
cb71f483a5 Warn about too long URLs. 2012-09-17 16:00:23 +02:00
Bastian Kleineidam
03667a4ec9 Print warning tags in text output. 2012-09-17 15:29:04 +02:00
Bastian Kleineidam
1f9ee987f9 Improved terminal color detection with curses. 2012-09-17 15:24:04 +02:00
Bastian Kleineidam
6e1841cf1f Print download and cache statistics. 2012-09-17 15:23:25 +02:00
Bastian Kleineidam
0b5b6ab37b Automatically set --complete for graph output. 2012-09-15 15:06:29 +02:00
Bastian Kleineidam
273230d98b Send HTTP Do-Not-Track header. 2012-09-14 22:41:38 +02:00
Bastian Kleineidam
e98f15933f Stop checking of all output loggers have been deactivated. 2012-09-14 22:36:59 +02:00
Bastian Kleineidam
81d2c4dbd9 Improved documentation. 2012-09-14 22:26:45 +02:00
Bastian Kleineidam
86f1c74006 Close loggers properly on I/O errors. 2012-09-14 22:09:18 +02:00
Bastian Kleineidam
6730fb51ee Allow maximum check time specification. 2012-09-03 20:17:49 +02:00
Bastian Kleineidam
a1dfaf2f91 Add missing docstring. 2012-09-02 23:37:43 +02:00
Bastian Kleineidam
21db38546c Updated copyright. 2012-09-02 23:36:31 +02:00
Bastian Kleineidam
3baaca47a0 Add maximum number of allowed puts on URL queue. 2012-09-02 22:44:29 +02:00
Bastian Kleineidam
d8fce1ceeb Do not sort URL queue anymore. 2012-09-02 22:32:14 +02:00
Bastian Kleineidam
7a6436f08f Increase checked cache in URL queue. 2012-09-02 22:21:49 +02:00
Bastian Kleineidam
4c16d3e702 Make 401 unauthorized GET response a warning. 2012-08-26 11:32:17 +02:00
Bastian Kleineidam
b6d45eabe5 Code cleanup. 2012-08-24 09:46:38 +02:00
Bastian Kleineidam
ac6591a009 Recognize WML files on Windows. 2012-08-24 09:46:26 +02:00
Bastian Kleineidam
7334a9863e Make URL properties in GUI selectable with the mouse. 2012-08-24 00:10:59 +02:00
Bastian Kleineidam
ae15d51b30 Translate more result strings. 2012-08-23 23:59:33 +02:00
Bastian Kleineidam
ce4253263c Do not special case http->ftp redirects. 2012-08-23 23:56:36 +02:00
Bastian Kleineidam
7374068941 Remove unused import. 2012-08-23 16:46:14 +02:00
Bastian Kleineidam
73d64e50ab Fix redirection to new scheme. 2012-08-23 16:45:24 +02:00
Bastian Kleineidam
99ab68908c Increase the default number of checker threads. 2012-08-23 16:11:47 +02:00
Bastian Kleineidam
bc287d7710 Make unauthorized access responses with missing www-authenticate headers an error. 2012-08-23 15:52:11 +02:00
Bastian Kleineidam
e252bbf623 Remove Amazon quirk because the default behaviour handles this now. 2012-08-23 05:36:51 +02:00
Bastian Kleineidam
02a9f0bacb Add utility method to read string options. 2012-08-23 04:52:25 +02:00
Bastian Kleineidam
ecef16b2c9 Support WML sites. 2012-08-22 22:43:14 +02:00
Bastian Kleineidam
36b1bb01e0 Fix variable name typo. 2012-08-22 22:00:11 +02:00
Bastian Kleineidam
8d36bf4e3d Show URLs in status bar. 2012-08-14 23:00:50 +02:00
Bastian Kleineidam
76f57dc4ad Updated copyright. 2012-08-14 20:37:24 +02:00
Bastian Kleineidam
6915e2f989 Detect sites not supporting HEAD requests. 2012-08-14 18:43:39 +02:00
Bastian Kleineidam
db76f01d48 Stop application when aborting timed out. Only used on the command line. 2012-08-14 17:41:26 +02:00
Bastian Kleineidam
29a5c1a44a Display the real url name in gui property field. 2012-08-13 18:55:25 +02:00
Bastian Kleineidam
f3b66b102d Fallback to GET when method HEAD is not allowed. 2012-08-13 07:07:21 +02:00
Bastian Kleineidam
e65b5c72ce Correct list of schemes requiring host name. 2012-08-12 14:21:56 +02:00
Bastian Kleineidam
7b567cc378 Make scheme and domain for internal url pattern case insensitive. 2012-08-12 14:19:42 +02:00
Bastian Kleineidam
afc0ecd7a6 --ignore-url now really ignores URLs. 2012-08-12 11:16:29 +02:00
Bastian Kleineidam
b86be09d9e Recalculate extern settings after changing intern patterns. 2012-08-12 11:15:18 +02:00
Bastian Kleineidam
6be3e9ddff Cleanup code and improve redirect anchor handling. 2012-08-12 11:14:56 +02:00
Bastian Kleineidam
10cc59c654 Use colorama only on Windows systems. 2012-08-12 10:23:44 +02:00
Bastian Kleineidam
cf53b33c94 Remove unused functions. 2012-08-11 19:34:27 +02:00
Bastian Kleineidam
aa22dc2702 Fix windows console output. 2012-08-11 07:52:04 +02:00
Bastian Kleineidam
d9acc97f9f Use colorama instead of wconio. 2012-08-10 22:24:00 +02:00
Bastian Kleineidam
c74690a79a Do not check SSl certificates on HTTPS -> HTTP redirects. 2012-08-10 19:43:57 +02:00
Bastian Kleineidam
451a520943 Prevent double color stream proxying. 2012-08-10 19:43:33 +02:00
Bastian Kleineidam
580ab74f0e Updated german translation. 2012-08-09 20:43:31 +02:00
Bastian Kleineidam
82b4dea4fe Updated copyright 2012-08-09 20:43:22 +02:00
Bastian Kleineidam
1c739aed81 Use urlparse.uses_relative instead of unofficial urlparse.non_hierarchical (which has been removed in the current CPython 2.7.x trunk). 2012-08-04 20:40:31 +02:00
Bastian Kleineidam
b0e5c7fc59 Ignore feed: URLs. 2012-06-27 21:32:03 +02:00
Bastian Kleineidam
0fd1a78378 Always compare encoded anchor names. 2012-06-27 20:59:53 +02:00
Bastian Kleineidam
e0d6aecad9 Add cancel button to show memory dialog. 2012-06-25 20:25:02 +02:00