C#中的HTML敏捷包

C#中的HTML敏捷包

本文介绍了C#中的HTML敏捷包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在一个电子商务网站到一个新的平台,因为他们所有的页面都是静态的HTML和他们没有在数据库中所有的产品信息,我们必须刮掉其当前网站的产品说明。

We are moving an e-commerce website to a new platform and because all of their pages are static html and they do not have all their product information in a database, we must scrape their current website for the product descriptions.

下面是其中一个页面:

Here is one of the pages: http://www.cabinplace.com/accrugsbathblackbear.htm

什么是最好的是让描述为一个字符串?我应该使用HTML敏捷性包?如果是的话怎么会这样做?因为我是新来的HTML一般的敏捷性包和XHTML。

What is the best was to get the description into a string? Should I use html agility pack? and if so how would this be done? as I am new to html agility pack and xhtml in general.

感谢

推荐答案

的HTML敏捷性包就是用这种工作的一个很好的库。

The HTML Agility Pack is a good library to use for this kind of work.

您没有说明,如果的所有的的内容被构造这样也不如果已经得到那种你从HTML文件张贴片段,因此很难进一步指教

You did not indicate if all of the content is structured this way nor if you have already gotten the kind of fragment you posted from the HTML files, so it is difficult to advise further.

在一般情况下,如果所有的页面类似的结构,我会用XPath表达式来提取段落,并从每个页面挑的innerHTML 的innerText

In general, if all pages are structured similarly, I would use an XPath expression to extract the paragraph and pick the innerHtml or innerText from each page.

类似以下内容:

var description = htmlDoc.SelectNodes("p[@class='content_txt']")[0].innerText;

这篇关于C#中的HTML敏捷包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-19 08:40