本文介绍了如何使用scrapy Selector获取节点的innerHTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设有一些html片段,例如:

Suppose there are some html fragments like:

<a>
   text in a
   <b>text in b</b>
   <c>text in c</c>
</a>
<a>
   <b>text in b</b>
   text in a
   <c>text in c</c>
</a>

例如,我要在标记中提取文本但在保留标记文本的同时排除这些标记,例如,我上面要提取的内容将类似于"b中的文本,c中的文本"和"b中的文本"文字公司中的文字".现在,我可以使用scrapy Selector css()函数获取节点,然后如何处理这些节点以获取所需的内容?任何想法将不胜感激,谢谢!

In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!

推荐答案

这是我设法做到的:

from scrapy.selector import Selector

sel = Selector(text = html_string)

for node in sel.css('a *::text'):
    print node.extract()

假设 html_string 是保存问题中html的变量,此代码将产生以下输出:

Assuming that html_string is a variable holding the html in your question, this code produces the following output:

   text in a

text in b


text in c




text in b

   text in a

text in c

选择器 a * :: text()匹配作为 a 节点后代的所有文本节点.

The selector a *::text() matches all the text nodes which are descendents of a nodes.

这篇关于如何使用scrapy Selector获取节点的innerHTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 11:37