如何使用python解码和编码网页？

本文介绍了如何使用python解码和编码网页？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Beautifulsoup和urllib2下载网页，但不同的网页有一个不同的编码方法，如utf-8，gb2312，gbk。我使用urllib2 get sohu的主页，它是用gbk编码的，但在我的代码中，我也用这种方法来解码其网页：

  self.html_doc = self.html_doc.decode（'gb2312'，'ignore'）

但是，在使用BeautifulSoup将其解码为unicode之前，如何才能使用页面使用的编码方法？在大多数中文网站上，http Header的字段中没有内容类型。

解决方案

使用BeautifulSoup可以解析HTML并访问 attrbute：

  import urllib2 
 from bs4 import BeautifulSoup 
 
 html = urllib2.urlopen（'http://www.sohu.com'）.read ）
 soup = BeautifulSoup（html）
 
>>>这个方法与在< meta> 标签在HTML的< head> 中：
 < meta http-equiv =content-typecontent =text / html; charset = GBK/> 
 
>>> soup.meta ['content'] 
 u'text / html; charset = GBK'
  
现在您可以解码HTML：
  decoded_html = html.decode（soup.original_encoding）
  
但是，由于HTML已经可以使用unicode，因此没有太多的意义：
 >>>汤[a标题] 
你的b $ b>>>> print soup.a ['title'] 
搜狐 - 中国最大的门户网站
>>> soup.a.text 
 u'\\\搜\\\狐'
>>> print soup.a.text 
搜狐
  
还可以尝试使用 chardet 模块（虽然有点慢）：
 > ;>>进口chardet 
>>>> chardet.detect（html）
 {'confidence'：0.99，'encoding'：'GB2312'} 
  
 
I use Beautifulsoup and urllib2 to download web pages, but different web page has a different encode method, such as utf-8,gb2312,gbk. I use urllib2 get sohu's home page, which is encoded with gbk, but in my code ,i also use this way to decode its web page:
self.html_doc = self.html_doc.decode('gb2312','ignore')
But how can I konw the encode method the pages use before I use BeautifulSoup to decode them to unicode? In most Chinese website, there is no content-type in http Header's field. 
 解决方案 
Using BeautifulSoup you can parse the HTML and access the original_encoding attrbute:
import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen('http://www.sohu.com').read()
soup = BeautifulSoup(html)

>>> soup.original_encoding
u'gbk'
And this agrees with the encoding declared in the <meta> tag in the HTML's <head>:
<meta http-equiv="content-type" content="text/html; charset=GBK" />

>>> soup.meta['content']
u'text/html; charset=GBK'
Now you can decode the HTML:
decoded_html = html.decode(soup.original_encoding)
but there not much point since the HTML is already available as unicode:
>>> soup.a['title']
u'\u641c\u72d0-\u4e2d\u56fd\u6700\u5927\u7684\u95e8\u6237\u7f51\u7ad9'
>>> print soup.a['title']
搜狐-中国最大的门户网站
>>> soup.a.text
u'\u641c\u72d0'
>>> print soup.a.text
搜狐
It is also possible to attempt to detect it using the chardet module (although it is a bit slow):
>>> import chardet
>>> chardet.detect(html)
{'confidence': 0.99, 'encoding': 'GB2312'}
                        
这篇关于如何使用python解码和编码网页？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！
                        1403页，肝出来的..