Commit graph

93 commits

Author SHA1 Message Date
Bastian Kleineidam
b18854649d Count unique URLs for url queue limit. 2014-03-14 20:21:46 +01:00
Bastian Kleineidam
257644e660 Add cache length function to get number of cached elements. 2014-03-14 20:19:34 +01:00
Bastian Kleineidam
6b334dc79b Fix URL result caching. 2014-03-08 19:35:10 +01:00
Bastian Kleineidam
b17211f162 Set for release. 2014-03-04 21:36:24 +01:00
Bastian Kleineidam
82f81241fd Check all links and add better caching. 2014-03-03 23:29:45 +01:00
Bastian Kleineidam
6f205a2574 Support checking Sitemap: URLs in robots.txt files. 2014-03-01 20:25:19 +01:00
Bastian Kleineidam
0f0d79c7e0 Remove crawl-delay stuff 2014-03-01 20:01:42 +01:00
Bastian Kleineidam
7b34be590b Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements. 2014-03-01 00:12:34 +01:00
Bastian Kleineidam
c806be5c15 Updated copyright 2014-01-08 22:33:04 +01:00
Bastian Kleineidam
e0a2558b2b Updated copyright. 2013-12-24 07:13:16 +01:00
Bastian Kleineidam
0ca63797bf Remove content cache. 2013-12-10 23:41:52 +01:00
Bastian Kleineidam
36badddfac Update cookie code from Python module. 2013-12-04 19:05:08 +01:00
Bastian Kleineidam
123578a4cd Make per-host connection limits configurable. 2013-02-27 19:37:28 +01:00
Bastian Kleineidam
35bc79dd90 Updated copyright. 2013-01-25 21:14:27 +01:00
Bastian Kleineidam
faa743e876 Increase per-host connection limits. 2013-01-22 18:18:48 +01:00
Bastian Kleineidam
0283362ce6 Updated copyright. 2012-12-23 21:32:16 +01:00
Bastian Kleineidam
42a17cbb98 Prepare py3 port and display sys.argv on internal errors. 2012-11-26 18:49:07 +01:00
Bastian Kleineidam
e5735e2a5d Fix URL queue handling. 2012-11-08 12:48:21 +01:00
Bastian Kleineidam
bc683577de Remove URLs from the in_progress cache. 2012-11-08 11:03:16 +01:00
Bastian Kleineidam
eabaa41bd2 Do not check duplicate URLs. 2012-11-06 21:34:22 +01:00
Bastian Kleineidam
8750d55a73 Add configuration entry for maximum number of URLs. 2012-10-14 11:13:55 +02:00
Bastian Kleineidam
3b5877161c Improved debugging. 2012-10-13 13:36:28 +02:00
Bastian Kleineidam
e1e80b7dd5 Remove addrinfo cache. 2012-10-10 10:54:58 +02:00
Bastian Kleineidam
871508ef5d Add docs and updated copyright. 2012-10-10 06:53:16 +02:00
Bastian Kleineidam
6d47b76509 Limit HTTP and FTP connections. Gets rid of spurious BadStatusLine errors. 2012-10-09 21:04:20 +02:00
Bastian Kleineidam
b56c054932 Use finer-grained robots.txt locks to improve lock contention. 2012-10-01 13:29:29 +02:00
Bastian Kleineidam
60305d8877 Code cleanup. 2012-09-23 21:20:12 +02:00
Bastian Kleineidam
e21187b275 Put in-progress URLs back near the front of URL queue, not at end. 2012-09-23 21:00:01 +02:00
Bastian Kleineidam
fba465e8e8 Fix robotstxt cache miss stats. 2012-09-21 21:12:28 +02:00
Bastian Kleineidam
4e59056ee7 Warn about duplicate URL contents. 2012-09-17 19:49:50 +02:00
Bastian Kleineidam
02a09dbb28 Add documentation. 2012-09-17 16:30:32 +02:00
Bastian Kleineidam
99bf8aa940 Updated copyright. 2012-09-17 16:09:55 +02:00
Bastian Kleineidam
6e1841cf1f Print download and cache statistics. 2012-09-17 15:23:25 +02:00
Bastian Kleineidam
21db38546c Updated copyright. 2012-09-02 23:36:31 +02:00
Bastian Kleineidam
3baaca47a0 Add maximum number of allowed puts on URL queue. 2012-09-02 22:44:29 +02:00
Bastian Kleineidam
d8fce1ceeb Do not sort URL queue anymore. 2012-09-02 22:32:14 +02:00
Bastian Kleineidam
7a6436f08f Increase checked cache in URL queue. 2012-09-02 22:21:49 +02:00
Bastian Kleineidam
9956f3712e Properly detect too-long Unicode hostnames. 2011-12-05 20:51:42 +01:00
Bastian Kleineidam
6b52b28425 Send all domain-matching cookies that apply. 2011-08-03 21:21:44 +02:00
Bastian Kleineidam
48413de418 Display warning message for each cookie parsing error. 2011-08-03 19:27:36 +02:00
Bastian Kleineidam
8779158b2f Sent cookies with more specific paths first. 2011-08-02 21:56:26 +02:00
Bastian Kleineidam
977d9e9ae6 Update cookie values instead of adding duplicate entries. 2011-08-01 20:26:31 +02:00
Bastian Kleineidam
2dfe62afa2 Updated copyright. 2011-02-14 21:07:07 +01:00
Bastian Kleineidam
c5884b8d87 Add function documentation. 2011-02-14 21:06:34 +01:00
Bastian Kleineidam
0589933b97 Reuse connections more than once. 2011-02-14 20:28:38 +01:00
Bastian Kleineidam
017a1087ba Remove unneeded __future__ import 2010-11-21 10:45:30 +01:00
Bastian Kleineidam
5bb222b1df Updated copyright 2010-10-24 01:02:39 +02:00
Bastian Kleineidam
fb4689dbe1 Fix previous commit. 2010-10-13 22:40:55 +02:00
Bastian Kleineidam
415efe262e Added equality check for Cookies, and use that to augment the retrieved cookies. 2010-10-13 22:35:36 +02:00
Bastian Kleineidam
1ce1521a9f Improved debug message and cleaned up some syntax. 2010-10-13 22:29:44 +02:00