本文介绍了使用Flash插件在网站上进行网络抓取尝试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取一个具有某种Flash插件的网站,该网站在检索html之后正在加载数据.页面中收到以下对象

I am attempting to scrape a website which has some kind of flash plugin which is loading data after i retrieve the html. The following object is received in the page

<OBJECT classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" WIDTH="250" HEIGHT="20" id="Preloader"><PARAM NAME="movie" VALUE="/images/preloader.swf">
      <PARAM NAME="quality" VALUE="high">
      <PARAM NAME**strong text**="bgcolor" VALUE="#FFFFFF"><EMBED src="/images/preloader.swf" quality="high" bgcolor="#FFFFFF" WIDTH="250" HEIGHT="20" NAME="Preloader" ALIGN="" TYPE="application/x-shockwave-flash" PLUGINSPAGE="http://www.macromedia.com/go/getflashplayer"></EMBED></OBJECT>

我试图找到Wireshark上接收到的数据,但是没有运气.我对这个Flash插件或其工作原理的了解为零.我猜测最坏的情况是我将无法做到这一点.

Ive attempted to locate the data being received on wireshark but no luck. My knowledge of this flash plugin or how it works is nil. Im guessing the worst case scenario that I will not be able to do this.

HttpWebRequest mainRequest = (HttpWebRequest)(WebRequest.Create(URL));
            mainRequest.Method = "GET";
            mainRequest.Proxy = null;
            WebResponse mainResponse = mainRequest.GetResponse();
            StreamReader dataReader = new StreamReader(mainResponse.GetResponseStream(), System.Text.Encoding.UTF8);
            string data = dataReader.ReadToEnd();
            dataReader.Close();
            mainResponse.Close();
            return data;

有人知道我可以接收此数据的方式还是让webresponse等待数据被接收到html之后再接收.任何帮助将不胜感激.

Does anyone know a way I can receive this data or make the webresponse wait for the data to be injected to the html before it is received. Any help would be greatly appreciated.

更新:看来我可能已经把闪光灯对准了目标.我认为这只是表格填充时的加载动画.我一直在用提琴手来看看发生了什么.在请求后返回该页面,其中包含一个加载div和其中包含的flash对象.几秒钟后,当数据准备就绪时,将返回另一页数据.据我所知(我不在家,所以现在无法确认),新页面具有与原始页面相同的请求标头.提琴手中没有json或ajax数据.客户端上没有脚本可以引起刷新,我可以看到.我不明白是什么原因导致此更新.

UPDATE:It seems I may have jumped the gun a little with the flash object. I think this is just a loading animation while the table populates. I've been using fiddler to see what is going on. The page is returned after a request with a loading div and the flash object contained inside. A few seconds later when the data is ready another page is returned with the data. From what I can rememebr (im not at home so cannot confirm right now) the new page has the same request header as the original. Theres no json or ajax data in fiddler. Theres no script on the client to cause a refresh that I can see. I do not understand what is causing this to update.

我已经简短地看了一下Web浏览器对象,但是我想当我抓取大约200页(目前需要一分钟左右)时,这将对性能造成很大的影响.稍后,我将尝试使用amf查看器来确认Flash对象不是更新源.

Ive briefly looked at the web browser object but I imagine this will be quite a performance hit when im scraping about 200 pages, currently taking a minute or so. I will try the amf viewer later to confirm that the flash object is not the source of the update.

我猜测服务器在准备好表后导致重新发送此页面.如果服务器正在查找装入div并将其替换为数据表,是否会导致重新发送整个页面?还是不会出现在ajax/json数据中?如果是服务器重新发送数据,如何在准备好发送新页面之前将响应保持打开状态?

Im guessing that the server is causing this page to be resent when it has the table ready.If the server is finding the loading div and replacing this with the table of data, would this cause the whole page to be resent? Or wouldnt this show up in ajax/json data? If it is the server resending the data, how can I keep the response open until it is ready to send the new page?

谢谢. JM.

推荐答案

如果将内容动态加载到Flash电影中,则很可能会通过标准HTTP请求发生该内容. Wire Shark对于检测类似这样的东西可能有点过大了.我建议使用可以捕获HTTP的实用程序,例如Charles,HttpFox或屏幕抓取器.使用这些工具之一,观察内容加载时发生的HTTP请求.一旦确定了哪个请求,就可以将其复制到代码中.

If the content is being loaded dynamically into the Flash movie it's very likely occurring over a standard HTTP request. Wire Shark may be a little overkill for detecting something like this. I'd recommend using a utility that will capture HTTP, such as Charles, HttpFox, or screen-scraper. Using one of those tools, watch the HTTP requests that occur while the content is loading. Once you determine which request it is it's likely you can just replicate it in your code.

也就是说,我还看到了一些情况(尽管不是很常见),其中加载到Flash电影中的数据是通过二进制协议完成的,这使事情变得有些困难.在这些情况下,通常使用AMF协议. Charles代理将检测到此协议,因此可能是在这种情况下使用的工具.不久前,我在提取通过AMF .它处理一个Java库,但是您可以在.NET中找到等效的东西.

That said, I've also seen cases (though not very common) where the data loaded into the Flash movie is done with a binary protocol, which makes things a little more difficult. AMF is often the protocol used in these cases. Charles proxy will detect this protocol, so that may be the tool to use in this case. A while back I wrote a blog post on extracting data that's delivered via AMF. It deals with a Java library, but you may be able to find something equivalent in .NET.

这篇关于使用Flash插件在网站上进行网络抓取尝试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-16 04:46