mirror of
https://github.com/Hopiu/linkchecker.git
synced 2026-05-01 11:34:41 +00:00
Add guidance on character set detecting including cchardet
This commit is contained in:
parent
2bc0b716bc
commit
3b19680e97
3 changed files with 21 additions and 3 deletions
|
|
@ -7,8 +7,25 @@ When installing from source, for application translations to be installed
|
|||
polib_ needs to be installed before LinkChecker. After LinkChecker installation
|
||||
polib_ can be removed.
|
||||
|
||||
There are several steps to resolve problems with detecting the character
|
||||
encoding of checked HTML pages:
|
||||
first ensure the web server, if used, is not returning an incorrect charset in
|
||||
the Content-Type header; second, if possible add a meta element to the HTML
|
||||
page with the correct charset; finally, install cchardet_ - Beautiful Soup has
|
||||
its own encoding detector but will use in order of preference cchardet_ or
|
||||
chardet_. You might find that one of the other three detectors works better for
|
||||
your pages. There may already be a system copy of e.g. chardet installed;
|
||||
installing LinkChecker in a Python venv_ gives control over which packages are
|
||||
used.
|
||||
|
||||
.. _chardet: https://pypi.org/project/chardet/
|
||||
|
||||
.. _cchardet: https://pypi.org/project/cchardet/
|
||||
|
||||
.. _polib: https://pypi.org/project/polib/
|
||||
|
||||
.. _venv: https://docs.python.org/3/library/venv.html#creating-virtual-environments
|
||||
|
||||
Setup with pip
|
||||
------------------
|
||||
pip_ can be used to install LinkChecker on the local system.
|
||||
|
|
|
|||
|
|
@ -195,8 +195,9 @@ class TestParser(unittest.TestCase):
|
|||
self.encoding_test(html, "ascii")
|
||||
|
||||
def encoding_test(self, html, expected):
|
||||
# If chardet is installed Beautiful Soup uses it for encoding detection.
|
||||
# For encoding detection Beautiful Soup uses if available in order
|
||||
# of preference cchardet then chardet.
|
||||
# Results for html without a valid charset may differ
|
||||
# based on chardet availability.
|
||||
# based on cchardet/chardet availability.
|
||||
soup = htmlsoup.make_soup(html)
|
||||
self.assertEqual(soup.original_encoding, expected)
|
||||
|
|
|
|||
2
tox.ini
2
tox.ini
|
|
@ -3,7 +3,7 @@ envlist = py36, py37, py38, py39, py310
|
|||
|
||||
[base]
|
||||
deps =
|
||||
chardet
|
||||
cchardet
|
||||
pyftpdlib
|
||||
parameterized
|
||||
pdfminer
|
||||
|
|
|
|||
Loading…
Reference in a new issue