本文介绍了Nokogiri 抓取带有格式和链接标签的文本,<em>、<strong>、<a> 等的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 Nokogiri 递归捕获所有带有格式化标签的文本?

这是 TD 中带有 <strong> 的文本.强 </strong>标签<p>这是一个子节点.与 <b>粗体</b>标签

<div id=2>"另一行文本到 <a href="link.html"> 链接 </a>"<p>这是 div inside 中的文本段落标签内的另一个 div

例如,我想捕获:

这是TD中带有标签的文本"这是一个子节点.带有<b>粗体</b>标签""另一行文本到 <a href="link.html"> 链接 </a>"这是一个 div 内的文本 inside 段落标签内的另一个 div"

我不能只使用 .text(),因为它会去除格式标签,而且我不知道如何递归地进行.

添加详细信息:Sanitize 看起来很有趣,我正在阅读.但是,添加一些可能会阐明我需要做什么的信息.

我需要遍历每个节点,获取文本,对其进行处理并将其放回原处.因此,我会从这是 TD 中带有 标签的文本"中获取文本,将其修改为类似这是 TD 中带有 标签的修改后的文本"> 标签.然后从 div 1 转到下一个标签,获取

文本.这是一个子节点.带有 粗体 标签"修改它这是一个修改后的子节点.带有 粗体 标签."并将其放回原处.转到下一个 div#2 并抓取文本,另一行文本到链接",修改它,另一行修改后的文本到链接",然后将其放回并转到下一个节点 Div#2 并从段落标记中抓取文本."这是段落标记内另一个 div 内的 div 内修改的文本"

所以在处理完所有内容后,新的 html 应该是这样的...

这是 TD 中带有 <strong> 的修改文本.强 </strong>标签<p>这是一个修改后的子节点.与 <b>粗体</b>标签

<div id=2>"另一行修改后的文本到 <a href="link.html"> 链接 </a>"<p>这是 div inside 中的修改文本.段落标签内的另一个 div

我的准代码,但我真的坚持两个部分,只抓取带有格式的文本(清理有助于),但清理抓取所有标签.我需要保留带有格式的文本格式,包括空格等.但是,不要抓取不相关的标签子项.第二,遍历所有与全文标签直接相关的子项.

#准代码doc = Nokogiri.HTML(html)kids=doc.at('div#1')text_kids=kids.descendant_elementstext.kids.each 做 |i|#获取带有格式化标签的全文(完整的句子和段落)#目前,我无法仅获取带有格式的文本而不是其他标签modified_text=processing_code(i.full_text_w_formating())i.full_text_w_formating=modified_text结尾def processing_code(string)#code 处理字符串(与此示例无关)返回已修改的字符串结尾# 递归 1类 Nokogiri::XML::Node定义后代元素#这是有缺陷的,因为它抓住了每个孩子,甚至#splits 它基于任何标签.# 我只需要向下遍历与文本相关的子项.element_children.map{ |孩子|[孩子,kid.descendant_elements]}.扁平化结尾结尾
解决方案

我会使用两种策略,Nokogiri 提取您想要的内容,然后使用黑名单/白名单程序去除您不想要的标签或保留这些标签你想要.

需要'nokogiri'需要消毒"html = '<div id="1">这是 TD 中带有 <strong> 的文本.强<强>标签<p>这是一个子节点.与 <b>粗体</b>标签

<div id=2>"另一行文本到 <a href="link.html"> 链接 </a>"<p>这是 div inside 中的文本段落标签内的另一个 div

'doc = Nokogiri.HTML(html)html_fragment = doc.at('div#1').to_html

会将

的内容捕获为 HTML 字符串:

 这是 TD 中带有  的文本.强<强>标签<p>这是一个子节点.与 <b>粗体</b>标签

<div id="2">"另一行文本到 <a href="link.html"> 链接 </a>"<p>这是 div inside 中的文本.段落标签内的另一个 div</em></em></p>

</strong></strong>

尾随 </strong></strong> 是两个打开的 标签的结果.这可能是故意的,但没有结束标记,Nokogiri 会做一些修正以使 HTML 正确.

html_fragment 传递给 Sanitize gem:

doc = Sanitize.clean(html_fragment,:元素 =>%w[ a b em strong ],:属性 =>{'a' =>%w[ href ],},)

返回的文本如下:

 这是 TD 中带有  的文本.强<强>标签这是一个子节点.与 <b>粗体</b>标签"另一行文本到 <a href="link.html"> 链接 </a>"这是 div inside 中的文本段落标签内的另一个 div</strong></strong>

同样,因为 HTML 格式不正确,没有结束 </strong> 标签,所以出现了两个尾随结束标签.

How can I recursively capture all the text with formatting tags using Nokogiri?

<div id="1">
  This is text in the TD with <strong> strong </strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

For example, I would like to capture:

"This is text in the TD with <strong> strong </strong> tags"

"This is a child node. with <b> bold </b> tags"

"another line of text to a <a href="link.html"> link </a>"

"This is text inside a div <em>inside<em> another div inside a paragraph tag"

I can't just use .text() because it strips the formatting tags and I'm not sure how to do it recursively.

ADDED DETAIL: Sanitize looks like an interesting gem, I'm reading it now. However, have some added info that might clarify what I need to do.

I need to traverse each node, get the text, process it and put it back. therefore I would grab the text from , "This is text in the TD with strong tags", modify it to something like, "This is the modified text in the TD with strong tags. Then goto the next tag from div 1 get the

text. "This is a child node. with bold tags" modify it "This is a modified child node. with bold tags." and put it back. Goto the next div#2 and grab the text, "another line of text to a link ", modify it, "another line of modified text to a link ", and put it back and goto the next node, Div#2 and grab text from the paragraph tag. "This is modified text inside a div inside another div inside a paragraph tag"

so after everything is processed the new html should be look like this...

<div id="1">
  This is modified text in the TD with <strong> strong </strong> tags
  <p>This is a modified child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of modified text to a <a href="link.html"> link </a>"
      <p> This is modified text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>

My quasi-code,but I'm really stuck on the two parts, grabbing just the text with formatting (which sanitize helps with), but sanitize grabs all tags. I need to preserve formatting of just the text with formatting, including spaces, etc. However, not grab the unrelated tag children. And two, traversing down all the children related directly with full text tags.

#Quasi-code
doc = Nokogiri.HTML(html)
kids=doc.at('div#1')
text_kids=kids.descendant_elements
text.kids.each do |i|
   #grab full text(full sentence and paragraphs) with formating tags
   #currently, I have not way to grab just the text with formatting and not the other tags
   modified_text=processing_code(i.full_text_w_formating())
   i.full_text_w_formating=modified_text
end

def processing_code(string)
#code to process string (not relevant for this example)
  return modified_string
end


# Recursive 1
class Nokogiri::XML::Node
  def descendant_elements
  #This is flawed because it grabs every child and even
  #splits it based on any tag.
  # I need to traverse down only the text related children.
  element_children.map{ |kid|
     [kid, kid.descendant_elements]
  }.flatten
  end
 end
解决方案

I'd use two tactics, Nokogiri to extract the content you want, then a blacklist/whitelist program to strip tags you don't want or keep the ones you want.

require 'nokogiri'
require 'sanitize'

html = '
<div id="1">
  This is text in the TD with <strong> strong <strong> tags
  <p>This is a child node. with <b> bold </b> tags</p>
  <div id=2>
      "another line of text to a <a href="link.html"> link </a>"
      <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
  </div>
</div>
'

doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html

will capture the contents of <div id="1"> as an HTML string:

      This is text in the TD with <strong> strong <strong> tags
      <p>This is a child node. with <b> bold </b> tags</p>
      <div id="2">
          "another line of text to a <a href="link.html"> link </a>"
          <p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
      </div>
    </strong></strong>

The trailing </strong></strong> is the result of two opening <strong> tags. That might be deliberate, but with no closing tags Nokogiri will do some fixup to make the HTML correct.

Passing html_fragment to the Sanitize gem:

doc = Sanitize.clean(
  html_fragment,
  :elements   => %w[ a b em strong ],
  :attributes => {
    'a'    => %w[ href ],
  },
)

The returned text looks like:

 This is text in the TD with <strong> strong <strong> tags
  This is a child node. with <b> bold </b> tags

      "another line of text to a <a href="link.html"> link </a>"
        This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em>

</strong></strong>

Again, because the HTML was malformed with no closing </strong> tags, the two trailing closing tags are present.

这篇关于Nokogiri 抓取带有格式和链接标签的文本,&lt;em&gt;、&lt;strong&gt;、&lt;a&gt; 等的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-09 22:31