本文介绍了html5lib: TypeError: __init__() 得到了一个意外的关键字参数“编码"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试安装 html5lib.起初我尝试安装最新版本(8 个或 9 个 9),但它与我的 BeautifulSoup 发生冲突,所以我决定尝试旧版本(0.9999999,七个九).我安装了它,但是当我尝试使用它时:

>>>使用 urlopen("http://example.com/") 作为 f:文档 = html5lib.parse(f, encoding=f.info().get_content_charset())

我收到一个错误:

回溯(最近一次调用最后一次):文件<pyshell#11>",第 2 行,在 <module> 中文档 = html5lib.parse(f, encoding=f.info().get_content_charset())文件C:PythonPython35-32libsite-packageshtml5libhtml5parser.py",第 35 行,解析返回 p.parse(doc, **kwargs)文件C:PythonPython35-32libsite-packageshtml5libhtml5parser.py",第 235 行,解析中self._parse(stream, False, None, *args, **kwargs)文件C:PythonPython35-32libsite-packageshtml5libhtml5parser.py",第 85 行,在 _parseself.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)文件C:PythonPython35-32libsite-packageshtml5lib\_tokenizer.py",第 36 行,在 __init__ 中self.stream = HTMLInputStream(stream, **kwargs)文件C:PythonPython35-32libsite-packageshtml5lib\_inputstream.py",第 151 行,在 HTMLInputStream 中返回 HTMLBinaryInputStream(source, **kwargs)类型错误:__init__() 得到了意外的关键字参数编码"

出了什么问题,我该怎么办?

解决方案

我看到最新版本的 html5lib 中关于 bs4 的一些问题,html5lib.treebuilders._base 不再存在,usng bs4 4.4.1 最新的兼容版本似乎是有 7 个 9 的版本,一旦安装如下,它就可以正常工作:

 pip3 install -U html5lib=="0.9999999"

使用 bs4 4.4.1 测试:

在[1]中:导入bs4在 [2]: bs4.__version__输出[2]:'4.4.1'在 [3]:导入 html5lib在 [4]: html5lib.__version__输出[4]:'0.9999999'在 [5]:从 urllib.request 导入 urlopen在 [6]: with urlopen("http://example.com/") as f:...:文档 = html5lib.parse(f, encoding=f.info().get_content_charset())...:在 [7] 中:

您可以在此提交中看到更改 将 treebuilders._base 重命名为 .base 以反映公开状态名称已更改:

您看到的错误是因为您仍在使用最新版本,在 html5lib/_inputstream.py 中,HTMLBinaryInputStream 没有编码参数:

class HTMLBinaryInputStream(HTMLUnicodeInputStream):"""向 HTMLTokenizer 提供一个 unicode 字符流.这个类负责字符编码和删除或替换不正确的字节序列,还提供列和行跟踪."""def __init__(self, source, override_encoding=None, transport_encoding=None,same_origin_parent_encoding=无,可能的编码=无,default_encoding="windows-1252", useChardet=True):

设置 override_encoding=f.info().get_content_charset() 应该可以解决问题.

同时升级到最新版本的 bs4 也适用于最新版本的 html5lib:

在 [16]: bs4.__version__输出 [16]:'4.5.1'在 [17]: html5lib.__version__出[17]:'0.999999999'在 [18]: with urlopen("http://example.com/") as f:文档 = html5lib.parse(f, override_encoding=f.info().get_content_charset())....:在[19]:

I'm trying to install html5lib. at first I tried to install the latest version (8 or 9 nines), but it came into conflict with my BeautifulSoup, so I decided to try older verison (0.9999999, seven nines ). I installed it, but when I try to use it:

>>> with urlopen("http://example.com/") as f:
    document = html5lib.parse(f, encoding=f.info().get_content_charset())

I get an error:

Traceback (most recent call last):
  File "<pyshell#11>", line 2, in <module>
    document = html5lib.parse(f, encoding=f.info().get_content_charset())
  File "C:PythonPython35-32libsite-packageshtml5libhtml5parser.py", line 35, in parse
    return p.parse(doc, **kwargs)
  File "C:PythonPython35-32libsite-packageshtml5libhtml5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "C:PythonPython35-32libsite-packageshtml5libhtml5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "C:PythonPython35-32libsite-packageshtml5lib\_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "C:PythonPython35-32libsite-packageshtml5lib\_inputstream.py", line 151, in HTMLInputStream
    return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

What is wrong and what should I do?

解决方案

I see something was broken in the latest versions of html5lib in regard to bs4, html5lib.treebuilders._base is no longer there, usng bs4 4.4.1 the latest compatible version seems to be the one with 7 nines, once you install it as below it works fine:

 pip3 install -U html5lib=="0.9999999"

Tested using bs4 4.4.1:

In [1]: import bs4

In [2]: bs4.__version__
Out[2]: '4.4.1'

In [3]: import html5lib

In [4]: html5lib.__version__
Out[4]: '0.9999999'

In [5]: from urllib.request import  urlopen

In [6]: with urlopen("http://example.com/") as f:
   ...:         document = html5lib.parse(f, encoding=f.info().get_content_charset())
   ...:

In [7]:

You can see the change in this commit Rename treebuilders._base to .base to reflect public status the name was changed:

The error you see is because you are still using the newest version, in html5lib/_inputstream.py, HTMLBinaryInputStream has no encoding arg:

class HTMLBinaryInputStream(HTMLUnicodeInputStream):
    """Provides a unicode stream of characters to the HTMLTokenizer.

    This class takes care of character encoding and removing or replacing
    incorrect byte-sequences and also provides column and line tracking.

    """

    def __init__(self, source, override_encoding=None, transport_encoding=None,
                 same_origin_parent_encoding=None, likely_encoding=None,
                 default_encoding="windows-1252", useChardet=True):

Setting override_encoding=f.info().get_content_charset() should do the trick.

Also upgrading to the latest version of bs4 works fine with the latest version of html5lib:

In [16]: bs4.__version__
Out[16]: '4.5.1'

In [17]: html5lib.__version__
Out[17]: '0.999999999'

In [18]: with urlopen("http://example.com/") as f:
             document = html5lib.parse(f, override_encoding=f.info().get_content_charset())
   ....:

In [19]:

这篇关于html5lib: TypeError: __init__() 得到了一个意外的关键字参数“编码"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-01 09:31