本文介绍了如何在< meta name ...>内获取信息在html中使用htmlParse和xpathSApply标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆网页,我想提取它们的发布日期。
对于某些网页,日期位于abbr标签中(如:abbr class = \published\title = \2012-03-14T07:13:39 + 00:00\ > 2012-03-14,7:13),并且我可以使用以下命令获得日期:
doc = htmlParse(theURL,asText = T)
xpathSApply(doc,// abbr ,xmlValue)



但是对于其他网页,日期位于mega标签中,例如:

meta name = \ created\content = \2011-12-29T11:49:23 + 00:00\

meta name = \OriginalPublicationDate\content = \2012/11 / 14 10:56:58 \

我尝试了xpathSApply(doc,// meta,xmlValue),但它不起作用。



那么,我应该使用什么样的模式来代替// meta?

谢谢!

解决方案

以此页面为例:

  library(XML)
url< - http://stackoverflow.com/questions/22342501/
doc< - htmlParse(url,useInternalNodes = T)
names< - doc [// meta / @ name]
content< - doc [// meta / @ content]
cbind(名称,内容)
#名称内容
#[1,]twitter:cardsummary
#[2,]twitter:domainstackoverflow.com
#[3,]og:typewebsite
#[4,]og:imagehttp://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6
#[5, ]og:title如何获取<元名称中的信息...>在HTML中使用htmlParse和xpathSApply标记
#[6,]og:description我有一堆网页,我想提取它们的发布日期。 \\\
对于一些网页,da[truncated]
#[7,]og:urlhttp://stackoverflow.com/questions/22342501/how-to-get-information-within-meta名称标签在html-usi[truncated]
  xpathSApply(doc,// meta,xmlValue)


$ b

是 xmlValue(...)返回元素内容(例如,元素)。< meta> 标签没有文字。


I have a bunch of webpages and I want to extract their publishing dates. For some webpages, the dates are in the "abbr" tag (like: abbr class=\"published\" title=\"2012-03-14T07:13:39+00:00\">2012-03-14, 7:13"), and I was able to get the dates using: doc=htmlParse(theURL,asText=T)xpathSApply(doc,"//abbr",xmlValue)

But for other webpages, the dates are in the "mega" tags, for example:
meta name=\"created\" content=\"2011-12-29T11:49:23+00:00\"
meta name=\"OriginalPublicationDate\" content=\"2012/11/14 10:56:58\"

I tried xpathSApply(doc, "//meta",xmlValue), but it didn't work.

So, what pattern should I use instead of "//meta"?

Thank you!

解决方案

Using this page as an example:

library(XML)
url <- "http://stackoverflow.com/questions/22342501/"
doc <- htmlParse(url, useInternalNodes=T)
names   <- doc["//meta/@name"]
content <- doc["//meta/@content"]
cbind(names,content)
#      names            content                                                                                                           
# [1,] "twitter:card"   "summary"                                                                                                         
# [2,] "twitter:domain" "stackoverflow.com"                                                                                               
# [3,] "og:type"        "website"                                                                                                         
# [4,] "og:image"       "http://cdn.sstatic.net/stackoverflow/img/apple-touch-icon@2.png?v=fde65a5a78c6"                                  
# [5,] "og:title"       "how to get information within <meta name...> tag in html using htmlParse and xpathSApply"                        
# [6,] "og:description" "I have a bunch of webpages and I want to extract their publishing dates. \nFor some webpages, the da" [truncated]
# [7,] "og:url"         "http://stackoverflow.com/questions/22342501/how-to-get-information-within-meta-name-tag-in-html-usi" [truncated] 

The problem with

xpathSApply(doc, "//meta",xmlValue)

is that xmlValue(...) returns the element content (e.g, the text part of an element). <meta> tags have no text.

这篇关于如何在&lt; meta name ...&gt;内获取信息在html中使用htmlParse和xpathSApply标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

11-01 21:53