问题描述
我正在尝试安装html5lib
.最初,我尝试安装最新版本(8或9个9),但是它与我的BeautifulSoup发生冲突,因此我决定尝试使用较旧的版本(0.9999999,七个九个).我安装了它,但是当我尝试使用它时:
I'm trying to install html5lib
. at first I tried to install the latest version (8 or 9 nines), but it came into conflict with my BeautifulSoup, so I decided to try older verison (0.9999999, seven nines ). I installed it, but when I try to use it:
>>> with urlopen("http://example.com/") as f:
document = html5lib.parse(f, encoding=f.info().get_content_charset())
我得到一个错误:
Traceback (most recent call last):
File "<pyshell#11>", line 2, in <module>
document = html5lib.parse(f, encoding=f.info().get_content_charset())
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 35, in parse
return p.parse(doc, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 235, in parse
self._parse(stream, False, None, *args, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 85, in _parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\_tokenizer.py", line 36, in __init__
self.stream = HTMLInputStream(stream, **kwargs)
File "C:\Python\Python35-32\lib\site-packages\html5lib\_inputstream.py", line 151, in HTMLInputStream
return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'
怎么了?我该怎么办?
推荐答案
我发现html5lib的最新版本中关于bs4的某些问题, html5lib.treebuilders._base 不再存在, usng bs4 4.4.1的最新兼容版本似乎是具有7个9的版本,一旦按如下所示安装它,它就可以正常工作:
I see something was broken in the latest versions of html5lib in regard to bs4, html5lib.treebuilders._base is no longer there, usng bs4 4.4.1 the latest compatible version seems to be the one with 7 nines, once you install it as below it works fine:
pip3 install -U html5lib=="0.9999999"
使用bs4 4.4.1测试:
Tested using bs4 4.4.1:
In [1]: import bs4
In [2]: bs4.__version__
Out[2]: '4.4.1'
In [3]: import html5lib
In [4]: html5lib.__version__
Out[4]: '0.9999999'
In [5]: from urllib.request import urlopen
In [6]: with urlopen("http://example.com/") as f:
...: document = html5lib.parse(f, encoding=f.info().get_content_charset())
...:
In [7]:
您可以看到此提交中的更改将treebuilders._base重命名为.base以反映公共状态名称已更改:
You can see the change in this commit Rename treebuilders._base to .base to reflect public status the name was changed:
您看到的错误是因为您仍在使用最新版本,在 html5lib/_inputstream.py 中, HTMLBinaryInputStream 没有编码arg:
The error you see is because you are still using the newest version, in html5lib/_inputstream.py, HTMLBinaryInputStream has no encoding arg:
class HTMLBinaryInputStream(HTMLUnicodeInputStream):
"""Provides a unicode stream of characters to the HTMLTokenizer.
This class takes care of character encoding and removing or replacing
incorrect byte-sequences and also provides column and line tracking.
"""
def __init__(self, source, override_encoding=None, transport_encoding=None,
same_origin_parent_encoding=None, likely_encoding=None,
default_encoding="windows-1252", useChardet=True):
设置 override_encoding = f.info().get_content_charset()应该可以解决问题.
Setting override_encoding=f.info().get_content_charset() should do the trick.
也可以使用最新版本的html5lib升级到bs4的最新版本:
Also upgrading to the latest version of bs4 works fine with the latest version of html5lib:
In [16]: bs4.__version__
Out[16]: '4.5.1'
In [17]: html5lib.__version__
Out[17]: '0.999999999'
In [18]: with urlopen("http://example.com/") as f:
document = html5lib.parse(f, override_encoding=f.info().get_content_charset())
....:
In [19]:
这篇关于html5lib:TypeError:__init __()得到了意外的关键字参数'encoding'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!