问题描述
假设有一些html片段,例如:
Suppose there are some html fragments like:
<a>
text in a
<b>text in b</b>
<c>text in c</c>
</a>
<a>
<b>text in b</b>
text in a
<c>text in c</c>
</a>
例如,我要在标记中提取文本但在保留标记文本的同时排除这些标记,例如,我上面要提取的内容将类似于"b中的文本,c中的文本"和"b中的文本"文字公司中的文字".现在,我可以使用scrapy Selector css()函数获取节点,然后如何处理这些节点以获取所需的内容?任何想法将不胜感激,谢谢!
In which I want to extract texts within tag but excluding those tags while keeping their text, for instance, the content I want to extract above would be like "text in a text in b text in c" and "text in b text in a text inc". Now I could get the nodes using scrapy Selector css() function, then how could I proceed these nodes to get what I want? Any idea would be appreciated, thank you!
推荐答案
这是我设法做到的:
from scrapy.selector import Selector
sel = Selector(text = html_string)
for node in sel.css('a *::text'):
print node.extract()
假设 html_string
是保存问题中html的变量,此代码将产生以下输出:
Assuming that html_string
is a variable holding the html in your question, this code produces the following output:
text in a
text in b
text in c
text in b
text in a
text in c
选择器 a * :: text()
匹配作为 a
节点后代的所有文本节点.
The selector a *::text()
matches all the text nodes which are descendents of a
nodes.
这篇关于如何使用scrapy Selector获取节点的innerHTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!