问题描述
我使用Beautifulsoup和urllib2下载网页,但不同的网页有一个不同的编码方法,如utf-8,gb2312,gbk。我使用urllib2 get sohu的主页,它是用gbk编码的,但在我的代码中,我也用这种方法来解码其网页:
self.html_doc = self.html_doc.decode('gb2312','ignore')
但是,在使用BeautifulSoup将其解码为unicode之前,如何才能使用页面使用的编码方法?在大多数中文网站上,http Header的字段中没有内容类型。
使用BeautifulSoup可以解析HTML并访问 attrbute:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.sohu.com').read )
soup = BeautifulSoup(html)
>>>这个方法与在< meta>
标签在HTML的< head>
中: < meta http-equiv =content-typecontent =text / html; charset = GBK/>
>>> soup.meta ['content']
u'text / html; charset = GBK'
现在您可以解码HTML:
decoded_html = html.decode(soup.original_encoding)
但是,由于HTML已经可以使用unicode,因此没有太多的意义:
>>>汤[a标题]
你的b $ b>>>> print soup.a ['title']
搜狐 - 中国最大的门户网站
>>> soup.a.text
u'\\\搜\\\狐'
>>> print soup.a.text
搜狐
还可以尝试使用 chardet
模块(虽然有点慢):
> ;>>进口chardet
>>>> chardet.detect(html)
{'confidence':0.99,'encoding':'GB2312'}
I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.
解决方案 Using BeautifulSoup you can parse the HTML and access the original_encoding
attrbute:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)
>>> soup.original_encoding
u'gbk'
And this agrees with the encoding declared in the <meta>
tag in the HTML's <head>
:
<meta http-equiv="content-type" content="text/html; charset=GBK" />
>>> soup.meta['content']
u'text/html; charset=GBK'
Now you can decode the HTML:
decoded_html = html.decode(soup.original_encoding)
but there not much point since the HTML is already available as unicode:
>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐
It is also possible to attempt to detect it using the chardet
module (although it is a bit slow):
>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}
这篇关于如何使用python解码和编码网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!