Bastian Kleineidam
|
10bbb696e8
|
Limit download file size to 5MB.
|
2011-05-05 21:10:55 +02:00 |
|
Bastian Kleineidam
|
719441cca5
|
Make module detection more robust and use it when possible.
|
2011-04-20 09:08:11 +02:00 |
|
Bastian Kleineidam
|
84f6d56a49
|
Print level in loggers xml, csv and sql.
|
2011-04-09 10:51:03 +02:00 |
|
Bastian Kleineidam
|
c0732e3d37
|
Do not print empty country information.
|
2011-04-06 17:22:48 +02:00 |
|
Bastian Kleineidam
|
82e5ba8ce6
|
Add warning tag attribute in XML loggers.
|
2011-03-15 13:42:21 +01:00 |
|
Bastian Kleineidam
|
7b33cfac7b
|
Use stripped URL base constructing absolute URL.
|
2011-03-11 15:17:36 +01:00 |
|
Bastian Kleineidam
|
420c21c2de
|
Strip leading and trailing whitespace from URLs.
|
2011-03-07 12:33:09 +01:00 |
|
Bastian Kleineidam
|
21e4824f65
|
Fix typo calling get_temp_file() function.
|
2011-03-07 09:57:40 +01:00 |
|
Bastian Kleineidam
|
0d4377d1ba
|
Support Google Chrome Bookmark files.
|
2011-02-15 18:26:00 +01:00 |
|
Bastian Kleineidam
|
25b6dc2e57
|
Refactor bookmark parsing code into own package.
|
2011-02-15 17:31:42 +01:00 |
|
Bastian Kleineidam
|
c5884b8d87
|
Add function documentation.
|
2011-02-14 21:06:34 +01:00 |
|
Bastian Kleineidam
|
4a0c63aa56
|
Fix joining of URLs when parent URL has CGI parameter.
|
2011-02-08 21:25:55 +01:00 |
|
Bastian Kleineidam
|
71b15b70f4
|
Updated copyright
|
2011-01-06 09:59:57 +01:00 |
|
Bastian Kleineidam
|
5f70b7210f
|
Add tempfile utility function.
|
2011-01-06 09:52:11 +01:00 |
|
Bastian Kleineidam
|
d011d1524c
|
Parse PHP files recursively.
|
2010-12-28 17:11:29 +01:00 |
|
Bastian Kleineidam
|
fd3fe8dcaa
|
Fix missing content types for cached URLs.
|
2010-12-23 07:37:36 +01:00 |
|
Bastian Kleineidam
|
6090e1a66c
|
Print anchor in __str__()
|
2010-12-21 20:55:49 +01:00 |
|
Bastian Kleineidam
|
7c08290c44
|
Fix broken anchor checking.
|
2010-12-20 19:55:26 +01:00 |
|
Bastian Kleineidam
|
224061e284
|
Fix to_wire by looking of URL parts have been initialized.
|
2010-12-15 13:24:12 +01:00 |
|
Bastian Kleineidam
|
2b2121b9ed
|
Added content type and domain to URL logging info.
|
2010-12-14 20:30:53 +01:00 |
|
Bastian Kleineidam
|
01184784ef
|
Remove warning about Unicode domains which are more widely supported now.
|
2010-12-11 07:58:15 +01:00 |
|
Bastian Kleineidam
|
f14340a0a8
|
Do not check content of already cached URLs.
|
2010-10-27 19:52:48 +02:00 |
|
Bastian Kleineidam
|
d9e981e497
|
Don't log a warning if commandline URL has been redirected.
|
2010-10-26 16:24:27 +02:00 |
|
Bastian Kleineidam
|
4375d35328
|
Add warning about unsupported HTTP authentication, and revert the realm changes.
|
2010-10-25 22:41:31 +02:00 |
|
Bastian Kleineidam
|
332fa4f8f9
|
Prepare multi-realm auth configuration.
|
2010-10-25 22:07:16 +02:00 |
|
Bastian Kleineidam
|
a8aa3bdb00
|
Another fix to ensure get_content() is only called when allowed.
|
2010-10-13 22:14:43 +02:00 |
|
Bastian Kleineidam
|
61e611e4bf
|
Prevent unallowed content read when checking for robots.txt allowance in HTML files.
|
2010-10-12 00:40:34 +02:00 |
|
Bastian Kleineidam
|
1d0db02192
|
Refactor getting user and password for an URL.
|
2010-10-11 20:11:15 +02:00 |
|
Bastian Kleineidam
|
e494d6bbb6
|
Move MIME type detection into fileutil.py module, and use mimetools for detection.
|
2010-10-03 08:47:48 +02:00 |
|
Bastian Kleineidam
|
840538d12a
|
Remove uneeded check for HTML content.
|
2010-09-29 19:25:14 +02:00 |
|
Bastian Kleineidam
|
279a1eae70
|
Only add geoip info for non-empty hostnames.
|
2010-09-29 15:59:57 +02:00 |
|
Bastian Kleineidam
|
cc848cdb33
|
Fix import for moved geoip module.
|
2010-09-29 15:17:27 +02:00 |
|
Bastian Kleineidam
|
8a1ac26c85
|
Warn about obfuscated IP numbers.
|
2010-09-05 20:11:02 +02:00 |
|
Bastian Kleineidam
|
8a074aeea9
|
Work around Python 2.6+ urljoin bug.
|
2010-08-31 09:16:24 +02:00 |
|
Bastian Kleineidam
|
c3b8ff00b3
|
Check content and recursion in one try/except to avoid multiple errors when getting page content.
|
2010-08-31 06:52:08 +02:00 |
|
Bastian Kleineidam
|
1faedafb33
|
Fix data size for HTTP requests.
|
2010-08-04 00:06:25 +02:00 |
|
Bastian Kleineidam
|
0f92b76290
|
Remove the unnormed URL warning.
|
2010-07-29 20:20:59 +02:00 |
|
Bastian Kleineidam
|
7ad4f7c220
|
Compare size from meta info and content data.
|
2010-07-29 19:53:41 +02:00 |
|
Bastian Kleineidam
|
d9bfd25a68
|
Add warning if content size is zero
|
2010-07-28 08:19:55 +02:00 |
|
Bluebird75
|
28f4514b67
|
Use object with __slots__ for wire-format of UrlBase objects.
Saves memory since UrlBase wire-format objects are used for
logging and thus often created.
Signed-off-by: Bastian Kleineidam <calvin@debian.org>
|
2010-03-27 00:07:19 +01:00 |
|
Bastian Kleineidam
|
3370ea1562
|
Reflect changes in httplib2.py: use buffered read in httplib response object and use bad status line exception attribute.
|
2010-03-26 20:50:38 +01:00 |
|
Bastian Kleineidam
|
37b4e97012
|
Revert "Only parse anchors if both --anchors option is given and the current link has an anchor."
This reverts commit b238527d54.
|
2010-03-10 00:04:02 +01:00 |
|
Bastian Kleineidam
|
b238527d54
|
Only parse anchors if both --anchors option is given and the current link has an anchor.
|
2010-03-09 11:45:50 +01:00 |
|
Bastian Kleineidam
|
57397e938b
|
Improved linkname parsing by adding a new peek() HTML parser function.
|
2010-03-09 11:31:12 +01:00 |
|
Bastian Kleineidam
|
51a0ef0ad4
|
Speed up HTML parsing by stopping early and adding callbacks.
|
2010-03-08 09:04:33 +01:00 |
|
Bastian Kleineidam
|
1e15e55689
|
Fix errors in Word file parsing.
|
2010-03-07 19:43:08 +01:00 |
|
Bastian Kleineidam
|
6a2fcf8ae9
|
Parse links in Word files.
|
2010-03-07 19:20:51 +01:00 |
|
Bastian Kleineidam
|
77daf80e82
|
Add url encoding parameter
|
2009-11-28 11:56:35 +01:00 |
|
Bastian Kleineidam
|
5e06b6b8d4
|
Updated FSF address in GPL blurb
|
2009-07-24 23:58:20 +02:00 |
|
Bastian Kleineidam
|
7f67027abf
|
ignore the fragment part (ie. the anchor) of URIs when
+ getting and caching content
|
2009-06-26 07:22:36 +02:00 |
|