我向CareerBuilder API发送GET请求:

import requests

url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
           'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text

然后返回一个看起来像this的XML。但是,我无法解析它。
使用lxml
>>> from lxml import etree
>>> print etree.fromstring(xml)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print etree.fromstring(xml)
  File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.

ElementTree:
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print ET.fromstring(xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)

因此,即使XML文件以
<?xml version="1.0" encoding="UTF-8"?>

我认为它包含不允许使用的字符。如何使用lxmlElementTree解析此文件?

最佳答案

您使用的是解码后的Unicode值。使用r.raw raw response data代替:

r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)

它将直接从响应中读取数据;请注意stream=True选项。
设置.get()标志可确保原始套接字将提供解压缩的内容,即使响应是gzip或deflate compressed。
您不必对响应进行流式处理;对于较小的XML文档,可以使用r.raw.decode_content = True attribute,这是未解码的响应主体:
r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)

XML解析器总是期望字节作为输入,因为XML格式本身指示解析器如何将这些字节解码为Unicode文本。

07-27 19:30