Commit graph

6246 commits

Author SHA1 Message Date
Chris Mayo
4ffdbf2406 Replace MetaRobotsFinder using BeautifulSoup.find() 2020-04-29 20:07:00 +01:00
anarcat
350f8bfef9
Merge pull request #373 from linkchecker/fix-swf-parsing
SWF files are binary data
2020-04-27 09:39:52 -04:00
Marius Gedminas
680783b1ff SWF files are binary data
Should fix #372.
2020-04-27 11:25:37 +03:00
anarcat
183d483074
Merge pull request #365 from cjmayo/tidyten1
Remove use of the future package
2020-04-26 12:02:30 -04:00
anarcat
125146fb2c
Merge pull request #361 from cjmayo/parser4
Rename htmlsax.py to htmlsoup.py and add test_content_allows_robots
2020-04-25 17:56:29 -04:00
anarcat
87079312db
Merge pull request #371 from cjmayo/manhtml
Switch to mandoc for generating html man pages
2020-04-24 18:59:10 -04:00
Chris Mayo
b7c8ad9be7 Fix typo for -Dplugin in man page 2020-04-24 19:46:30 +01:00
Chris Mayo
5dd448cf05 Add link to unknownurl.py in man page 2020-04-24 19:46:30 +01:00
Chris Mayo
a506800c07 Replace `` in man page with bold formatting 2020-04-24 19:46:30 +01:00
Chris Mayo
e3b77f810e Update external links in man pages to https 2020-04-24 19:46:30 +01:00
Chris Mayo
a205a3722b Update man pages to optimise for both html and man
- Use "LinkChecker User Manual" as the source for both pages.
- .UR/.UE for external links to allow mandoc to create links in html.
- Use Linux man-pages format for cross references e.g.
  .BR linkcheckerrc (5) which are replace in the html by the Makefile.
2020-04-24 19:46:30 +01:00
Chris Mayo
441cda5e15 Switch to mandoc for generating html man pages
Removes the need for diff files and is a currently maintained project.

Cross references are only supported for mdoc macros but because we only
have two pages this can be achieved with sed.

A clean target is added to the Makefile to make development easier.
2020-04-24 19:46:30 +01:00
Chris Mayo
56b8c9f7ab Add tests for <meta name="robots" content="nofollow">
norobots.html was used for testing <meta name="robots"
content="nofollow"> in local files until [1]. This commit reinstates
local file testing and adds an http test.

Checking is reported by checker.httpurl.HttpUrl.content_allows_robots().

[1] ce733ae7 ("Don't check for robots.txt directives in local html
files.", 2014-03-19)
2020-04-18 20:30:46 +01:00
Chris Mayo
d189445a8e LinkFinder does not raise StopParse 2020-04-18 20:30:46 +01:00
Chris Mayo
ee6628a831 Move HtmlParser/htmlsax.py to htmlutil/htmlsoup.py
Remove one subpackage and some import lines where htmlutil.linkparse is
also being used.
2020-04-18 20:30:45 +01:00
anarcat
0f18c9b8f0
Merge pull request #360 from cjmayo/parser3
Replace Parser class using BeautifulSoup.find_all()
2020-04-18 14:37:03 -04:00
Chris Mayo
384e1e196d Remove Python 2 gettext builtin installation 2020-04-15 19:49:16 +01:00
Chris Mayo
a83fbb56c0 Remove from __future__ imports 2020-04-15 19:49:16 +01:00
Chris Mayo
f5e7f3a382 Remove use of the future package
It was providing Python 2 compatibility.
2020-04-15 19:49:16 +01:00
Chris Mayo
0795e3c1b4 Replace Parser class using BeautifulSoup.find_all() 2020-04-10 13:51:09 +01:00
Chris Mayo
eb3cf28baa Remove support for start_end_element() callback
The LinkFinder handler start_end_element() callback does nothing apart
from call start_element().
2020-04-10 13:51:09 +01:00
Chris Mayo
c9f17e92b9 Remove support for end_element() callback 2020-04-10 13:51:09 +01:00
Chris Mayo
48b590cf8b Replace FormFinder using BeautifulSoup.find_all()
FormFinder was the only handler that used an end_element() callback and
was therefore a blocker to moving the Parser class to use
BeautifulSoup.find_all()

FormFinder was a specialised handler used to parse a login form at
the start of a session if the user had configured authentication
credentials.
2020-04-10 13:51:05 +01:00
anarcat
d80a075372
Merge pull request #357 from cjmayo/parser2
Simplify the Parser class
2020-04-09 15:22:14 -04:00
Chris Mayo
974915cc4f Remove encoding from Parser
Only used by the test and an attribute of the soup object.
2020-04-08 20:03:35 +01:00
Chris Mayo
02e1c389b2 Remove parser flush() and reset()
Remnants of the feed() interface.
2020-04-08 20:03:35 +01:00
Chris Mayo
3771dd9136 Use parser.feed_soup() instead of parser.feed()
Markup is not being passed in pieces to the parser, so simplify the
interface and reduce the state further.
2020-04-08 20:03:35 +01:00
Chris Mayo
40f43ae41c Create one function to make soup objects 2020-04-08 20:03:35 +01:00
Chris Mayo
9d8d251d06 Replace Parser lineno() and column() methods
Stop storing this data in Parser object state.
2020-04-08 20:03:35 +01:00
anarcat
e6374fa73a
Merge pull request #358 from cjmayo/testform
Add a test for search_form
2020-04-07 17:37:15 -04:00
Chris Mayo
16e6fb2919 Fix incorrect character in FormFinder log message 2020-04-07 19:24:34 +01:00
Chris Mayo
00f940d979 Fix FormFinder callbacks for missing element_text
element_text added in:
51a06d8a ("Remove home-cooked htmlparser and use BeautifulSoup",
2019-07-22)
2020-04-07 19:24:34 +01:00
Chris Mayo
514210199d Add tests for search_form 2020-04-07 19:24:34 +01:00
anarcat
7d55855ffb
Merge pull request #356 from cjmayo/parser1
Remove unecessary parser related code
2020-04-04 09:26:51 -04:00
Chris Mayo
fe024fb0c8 Remove unused Parser.debug() method 2020-04-03 19:24:08 +01:00
Chris Mayo
0c5e3bb403 Remove old HtmlParser .gitignore
htmlparse.output was a product of the built-in parser.
2020-04-03 19:24:08 +01:00
Chris Mayo
036b900ffc Remove unused linkcheck.containers classes 2020-04-03 19:24:08 +01:00
Chris Mayo
3ff3d72492 Use BeautifulSoup element attrs directly 2020-04-03 19:24:08 +01:00
Chris Mayo
a7e1e20172 Remove last line and column from Parser
Only used for debug log message and not very useful.
2020-04-03 19:24:08 +01:00
anarcat
25d517521c
Merge pull request #353 from cjmayo/setup
Tidy setup.py for C extensions and Python 2
2020-04-02 10:10:38 -04:00
anarcat
39aa438d06
Merge pull request #354 from cjmayo/unicode
Remove use of Python 2 unicode() and related u prefixes
2020-04-02 10:10:31 -04:00
Chris Mayo
28701e291a Remove use of Python 2 unicode() and related u prefixes
Several instances for MS Windows left unchanged.
2020-04-01 19:39:50 +01:00
Chris Mayo
e0bf5fc24f Remove unused imports and variables from setup.py 2020-04-01 19:21:47 +01:00
Chris Mayo
f6b273d05e Remove code for compiling C extensions from setup.py
C extensions for parser and network utilities have been replaced in
Python.
2020-04-01 19:21:47 +01:00
Chris Mayo
9f899605a9 Remove Python 2 compatibility from setup.py
sys.version_info was introduced in Python 2.0.
2020-04-01 19:21:47 +01:00
anarcat
cf4e6bb235
Merge pull request #351 from cjmayo/tagsonly
Remove support for non-Tag elements from Parser
2020-04-01 12:17:18 -04:00
Marius Gedminas
7c14bf1ad6 Declare supported Python versions in setup.py
The python_requires is the important one; it means once we publish a
new release on PyPI, pip install will know not to try to install it if
you run it on Python 2 and will fall back to an older version.
2020-04-01 17:49:51 +03:00
anarcat
b5c8a5d1ce
Merge pull request #314 from cjmayo/postbs4
Replace memoized with functools.lru_cache and deprecations
2020-04-01 10:28:18 -04:00
Chris Mayo
9fc651e82b Remove Python 2 compatibility from parser tests 2020-03-31 20:10:35 +01:00
Chris Mayo
ffa6ac457f Remove support for non-Tag elements from Parser
This change is made because the linkchecker handlers only process
Tags.

The test HtmlPrettyPrinter handler is updated to output element text
because its support for non-Tag elements has been removed. This results
in a number of the existing tests still passing.
2020-03-31 20:10:35 +01:00