本文介绍了通过nokogiri或hpricot进行屏幕抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试获取给定xpath的实际值.我在sample.rb文件中有以下代码

I'm trying to get actual value of given xpath. I am having the following code in sample.rb file

require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.changebadtogood.com/'))
desc "Trying to get the value of given xapth"
task :sample do
  begin
    doc.xpath('//*[@id="view_more"]').each do |link|
      puts link.content
    end
  rescue Exception => e
    puts "error"
  end
end

输出为:

当我尝试获取其他不同XPath的值时,例如:
/html/body/div[4]/div[3]/h1/span然后我收到错误" 消息.

When I try to get the value for other a different XPath, such as:
/html/body/div[4]/div[3]/h1/spanthen I get the "error" message.

我在Nokogiri中尝试过.我不知道为什么这只为少数XPath提供了结果.

I tried in this in Nokogiri. I don't know why this is giving result for few XPaths only.

我在Hpricot中尝试过相同的方法.
http://hpricot.com/demonstrations

I tried the same in Hpricot.
http://hpricot.com/demonstrations

我粘贴了我的网址和XPath,然后看到了
的结果 //*[@id="view_more"]
作为
查看更多问题.
[本文在最近发行的标题的底部]

I paste my url and XPaths and I see the result for
//*[@id="view_more"]
as
View more issues ..
[This text is present at bottom of recent issues header]

但未显示以下结果:
/html/body/div[4]/div[3]/h1/span 对于此XPath,我期望结果为Bad.
[这存在于 http://www.changebadtogood.com/作为class ="hero-unit" div的第一个标头. ]

But it is not showing result for:
/html/body/div[4]/div[3]/h1/spanFor this XPath I'm expecting the result Bad.
[This was present inhttp://www.changebadtogood.com/ as the first header of class="hero-unit" div.]

推荐答案

您的问题与不良的XPath选择器有关,并且与Nokogiri或Hpricot无关.让我们调查一下:

Your problem has to do with a poor XPath selector, and is unrelated to Nokogiri or Hpricot. Let's investigate:

irb:01:0> require 'nokogiri'; require 'open-uri'
#=> true
irb:02:0> doc = Nokogiri::HTML(open('http://www.changebadtogood.com/')); nil
#=> nil
irb:03:0> doc.xpath('//*[@id="view_more"]').each{ |link| puts link.content }
View more issues ..
#=> 0
irb:04:0> doc.at('#view_more').text  # Simpler version of the above.
#=> "View more issues .."
irb:05:0> doc.xpath('/html/body/div[4]/div[3]/h1/span')
#=> []
irb:06:0> doc.xpath('/html/body/div[4]')
#=> []
irb:07:0> doc.xpath('/html/body/div').length
#=> 2

由此我们可以看到,只有两个div是<body>元素的子元素,因此div[4]无法选择一个.

From this we can see that there are only two divs that are children of the <body> element, and so div[4] fails to select one.

您似乎正在尝试在此处选择跨度:

It appears that you're trying to select the span here:

<h1 class="landing_page_title">
  Change <span style='color: #808080;'>Bad</span> To Good
</h1>

与其依赖脆弱的标记(索引元素的匿名层次结构),还可以使用文档的语义结构来利用选择器,该选择器既简单又健壮.使用CSS或XPath语法:

Instead of relying on the fragile markup leading up to this (indexing anonymous hierarchies of element), use the semantic structure of the document to your advantage for a selector that is both simpler and more robust. Using either CSS or XPath syntax:

irb:08:0> doc.at('h1.landing_page_title > span').text
#=> "Bad"
irb:09:0> doc.at_xpath('//h1[@class="landing_page_title"]/span').text
#=> "Bad"

这篇关于通过nokogiri或hpricot进行屏幕抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-11 20:25