如何使用scrapy Selector获取节点的innerHTML?

本文介绍了如何使用scrapy Selector获取节点的innerHTML?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设有一些html片段，例如:

Suppose there are some html fragments like:

<a>
   text in a
   <b>text in b</b>
   <c>text in c</c>
</a>
<a>
   <b>text in b</b>
   text in a
   <c>text in c</c>
</a>

例如，我要在标记中提取文本但在保留标记文本的同时排除这些标记，例如，我上面要提取的内容将类似于"b中的文本，c中的文本"和"b中的文本"文字公司中的文字".现在，我可以使用scrapy Selector css()函数获取节点，然后如何处理这些节点以获取所需的内容?任何想法将不胜感激，谢谢！

In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!

推荐答案

这是我设法做到的:

from scrapy.selector import Selector

sel = Selector(text = html_string)

for node in sel.css('a *::text'):
    print node.extract()

假设 html_string 是保存问题中html的变量，此代码将产生以下输出:

Assuming that html_string is a variable holding the html in your question, this code produces the following output:

   text in a

text in b


text in c




text in b

   text in a

text in c

选择器 a * :: text()匹配作为 a 节点后代的所有文本节点.

The selector a *::text() matches all the text nodes which are descendents of a nodes.

这篇关于如何使用scrapy Selector获取节点的innerHTML?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！