本文介绍了如何使用python解码和编码网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

我使用Beautifulsoup和urllib2下载网页,但不同的网页有一个不同的编码方法,如utf-8,gb2312,gbk。我使用urllib2 get sohu的主页,它是用gbk编码的,但在我的代码中,我也用这种方法来解码其网页:

  self.html_doc = self.html_doc.decode('gb2312','ignore')

但是,在使用BeautifulSoup将其解码为unicode之前,如何才能使用页面使用的编码方法?在大多数中文网站上,http Header的字段中没有内容类型。

解决方案

使用BeautifulSoup可以解析HTML并访问 attrbute:

  import urllib2 
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.sohu.com').read )
soup = BeautifulSoup(html)

>>>这个方法与在< meta> 标签在HTML的< head> 中:

 < meta http-equiv =content-typecontent =text / html; charset = GBK/> 

>>> soup.meta ['content']
u'text / html; charset = GBK'

现在您可以解码HTML:

  decoded_html = html.decode(soup.original_encoding)

但是,由于HTML已经可以使用unicode,因此没有太多的意义:

 >>>汤[a标题] 
你的b $ b>>>> print soup.a ['title']
搜狐 - 中国最大的门户网站
>>> soup.a.text
u'\\\搜\\\狐'
>>> print soup.a.text
搜狐

还可以尝试使用 chardet 模块(虽然有点慢):

 > ;>>进口chardet 
>>>> chardet.detect(html)
{'confidence':0.99,'encoding':'GB2312'}


I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:

self.html_doc = self.html_doc.decode('gb2312','ignore')

But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field.

解决方案

Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)

>>> soup.original_encoding
u'gbk'

And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:

<meta http-equiv="content-type" content="text/html; charset=GBK" />

>>> soup.meta['content']
u'text/html; charset=GBK'

Now you can decode the HTML:

decoded_html = html.decode(soup.original_encoding)

but there not much point since the HTML is already available as unicode:

>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐

It is also possible to attempt to detect it using the chardet module (although it is a bit slow):

>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}

这篇关于如何使用python解码和编码网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-06 17:27