Add guidance on character set detecting including cchardet

This commit is contained in:
Chris Mayo 2021-12-06 19:24:26 +00:00
parent 2bc0b716bc
commit 3b19680e97
3 changed files with 21 additions and 3 deletions

View file

@ -7,8 +7,25 @@ When installing from source, for application translations to be installed
polib_ needs to be installed before LinkChecker. After LinkChecker installation
polib_ can be removed.
There are several steps to resolve problems with detecting the character
encoding of checked HTML pages:
first ensure the web server, if used, is not returning an incorrect charset in
the Content-Type header; second, if possible add a meta element to the HTML
page with the correct charset; finally, install cchardet_ - Beautiful Soup has
its own encoding detector but will use in order of preference cchardet_ or
chardet_. You might find that one of the other three detectors works better for
your pages. There may already be a system copy of e.g. chardet installed;
installing LinkChecker in a Python venv_ gives control over which packages are
used.
.. _chardet: https://pypi.org/project/chardet/
.. _cchardet: https://pypi.org/project/cchardet/
.. _polib: https://pypi.org/project/polib/
.. _venv: https://docs.python.org/3/library/venv.html#creating-virtual-environments
Setup with pip
------------------
pip_ can be used to install LinkChecker on the local system.

View file

@ -195,8 +195,9 @@ class TestParser(unittest.TestCase):
self.encoding_test(html, "ascii")
def encoding_test(self, html, expected):
# If chardet is installed Beautiful Soup uses it for encoding detection.
# For encoding detection Beautiful Soup uses if available in order
# of preference cchardet then chardet.
# Results for html without a valid charset may differ
# based on chardet availability.
# based on cchardet/chardet availability.
soup = htmlsoup.make_soup(html)
self.assertEqual(soup.original_encoding, expected)

View file

@ -3,7 +3,7 @@ envlist = py36, py37, py38, py39, py310
[base]
deps =
chardet
cchardet
pyftpdlib
parameterized
pdfminer