Marius Gedminas
|
4a092c218c
|
Whitespace bigotry
|
2017-03-14 17:18:27 +02:00 |
|
Petr Dlouhý
|
eaa538c814
|
don't check one url multiple times
|
2017-02-14 10:23:25 +01:00 |
|
Nicolas Bigaouette
|
4e56eceb35
|
Detect if "url_data" contains proxy attributes before using them.
Fix proposed by @colwilson in issue #555.
|
2014-11-12 09:58:30 -05:00 |
|
Bastian Kleineidam
|
35eb30432e
|
Added some Python3 fixes.
|
2014-09-12 19:36:30 +02:00 |
|
Bastian Kleineidam
|
06c6b80ed3
|
Fix proxy support.
|
2014-09-05 22:48:10 +02:00 |
|
Arlo Louis O'Keeffe
|
52337f82cb
|
Use correct attribute
|
2014-09-03 09:36:22 +02:00 |
|
Bastian Kleineidam
|
b646293fd6
|
Remove unused import.
|
2014-07-15 22:38:57 +02:00 |
|
Bastian Kleineidam
|
90257a1b5e
|
Replace twill with custom code.
|
2014-07-15 18:37:05 +02:00 |
|
Bastian Kleineidam
|
a665d35feb
|
Use proxies and checker session in robots.txt.
|
2014-07-14 20:28:28 +02:00 |
|
Bastian Kleineidam
|
6c38b4165a
|
Use given HTTP auth data for robots.txt fetching.
|
2014-07-14 19:50:11 +02:00 |
|
Bastian Kleineidam
|
22caa9367a
|
Refactor recursion checks.
|
2014-04-10 17:50:55 +02:00 |
|
Bastian Kleineidam
|
08fbd891ef
|
Do not check external robots.txt sitemaps.
|
2014-04-09 19:44:29 +02:00 |
|
Bastian Kleineidam
|
c57f607fc3
|
Use urldata.add_url()
|
2014-04-07 18:54:33 +02:00 |
|
Bastian Kleineidam
|
fc73c6ca6e
|
Log number of checked unique URLs.
|
2014-03-14 23:46:17 +01:00 |
|
Bastian Kleineidam
|
19b8baf08c
|
Move cached queue items to top once in a while.
|
2014-03-14 22:08:51 +01:00 |
|
Bastian Kleineidam
|
b18854649d
|
Count unique URLs for url queue limit.
|
2014-03-14 20:21:46 +01:00 |
|
Bastian Kleineidam
|
257644e660
|
Add cache length function to get number of cached elements.
|
2014-03-14 20:19:34 +01:00 |
|
Bastian Kleineidam
|
6b334dc79b
|
Fix URL result caching.
|
2014-03-08 19:35:10 +01:00 |
|
Bastian Kleineidam
|
b17211f162
|
Set for release.
|
2014-03-04 21:36:24 +01:00 |
|
Bastian Kleineidam
|
82f81241fd
|
Check all links and add better caching.
|
2014-03-03 23:29:45 +01:00 |
|
Bastian Kleineidam
|
6f205a2574
|
Support checking Sitemap: URLs in robots.txt files.
|
2014-03-01 20:25:19 +01:00 |
|
Bastian Kleineidam
|
0f0d79c7e0
|
Remove crawl-delay stuff
|
2014-03-01 20:01:42 +01:00 |
|
Bastian Kleineidam
|
7b34be590b
|
Introduce check plugins, use Python requests for http/s connections, and some code cleanups and improvements.
|
2014-03-01 00:12:34 +01:00 |
|
Bastian Kleineidam
|
c806be5c15
|
Updated copyright
|
2014-01-08 22:33:04 +01:00 |
|
Bastian Kleineidam
|
e0a2558b2b
|
Updated copyright.
|
2013-12-24 07:13:16 +01:00 |
|
Bastian Kleineidam
|
0ca63797bf
|
Remove content cache.
|
2013-12-10 23:41:52 +01:00 |
|
Bastian Kleineidam
|
36badddfac
|
Update cookie code from Python module.
|
2013-12-04 19:05:08 +01:00 |
|
Bastian Kleineidam
|
123578a4cd
|
Make per-host connection limits configurable.
|
2013-02-27 19:37:28 +01:00 |
|
Bastian Kleineidam
|
35bc79dd90
|
Updated copyright.
|
2013-01-25 21:14:27 +01:00 |
|
Bastian Kleineidam
|
faa743e876
|
Increase per-host connection limits.
|
2013-01-22 18:18:48 +01:00 |
|
Bastian Kleineidam
|
0283362ce6
|
Updated copyright.
|
2012-12-23 21:32:16 +01:00 |
|
Bastian Kleineidam
|
42a17cbb98
|
Prepare py3 port and display sys.argv on internal errors.
|
2012-11-26 18:49:07 +01:00 |
|
Bastian Kleineidam
|
e5735e2a5d
|
Fix URL queue handling.
|
2012-11-08 12:48:21 +01:00 |
|
Bastian Kleineidam
|
bc683577de
|
Remove URLs from the in_progress cache.
|
2012-11-08 11:03:16 +01:00 |
|
Bastian Kleineidam
|
eabaa41bd2
|
Do not check duplicate URLs.
|
2012-11-06 21:34:22 +01:00 |
|
Bastian Kleineidam
|
8750d55a73
|
Add configuration entry for maximum number of URLs.
|
2012-10-14 11:13:55 +02:00 |
|
Bastian Kleineidam
|
3b5877161c
|
Improved debugging.
|
2012-10-13 13:36:28 +02:00 |
|
Bastian Kleineidam
|
e1e80b7dd5
|
Remove addrinfo cache.
|
2012-10-10 10:54:58 +02:00 |
|
Bastian Kleineidam
|
871508ef5d
|
Add docs and updated copyright.
|
2012-10-10 06:53:16 +02:00 |
|
Bastian Kleineidam
|
6d47b76509
|
Limit HTTP and FTP connections. Gets rid of spurious BadStatusLine errors.
|
2012-10-09 21:04:20 +02:00 |
|
Bastian Kleineidam
|
b56c054932
|
Use finer-grained robots.txt locks to improve lock contention.
|
2012-10-01 13:29:29 +02:00 |
|
Bastian Kleineidam
|
60305d8877
|
Code cleanup.
|
2012-09-23 21:20:12 +02:00 |
|
Bastian Kleineidam
|
e21187b275
|
Put in-progress URLs back near the front of URL queue, not at end.
|
2012-09-23 21:00:01 +02:00 |
|
Bastian Kleineidam
|
fba465e8e8
|
Fix robotstxt cache miss stats.
|
2012-09-21 21:12:28 +02:00 |
|
Bastian Kleineidam
|
4e59056ee7
|
Warn about duplicate URL contents.
|
2012-09-17 19:49:50 +02:00 |
|
Bastian Kleineidam
|
02a09dbb28
|
Add documentation.
|
2012-09-17 16:30:32 +02:00 |
|
Bastian Kleineidam
|
99bf8aa940
|
Updated copyright.
|
2012-09-17 16:09:55 +02:00 |
|
Bastian Kleineidam
|
6e1841cf1f
|
Print download and cache statistics.
|
2012-09-17 15:23:25 +02:00 |
|
Bastian Kleineidam
|
21db38546c
|
Updated copyright.
|
2012-09-02 23:36:31 +02:00 |
|
Bastian Kleineidam
|
3baaca47a0
|
Add maximum number of allowed puts on URL queue.
|
2012-09-02 22:44:29 +02:00 |
|