本文介绍了使用Selenium的浏览器自动化:指纹,可识别性和可追溯性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用selenium/webdriver来模拟浏览器并使用它来抓取一些网站内容.即使不是最快的方法,对我来说它也具有执行脚本等许多优点.

I want to use selenium/webdriver to simulate a browser and scrape some website-content with it. Even if its not the fastest method, for me it has many advantages such as executing scripts etc.

对于许多网站,禁止通过自动化方法访问它们,例如google或bing之类的搜索引擎.

For many websites it is forbidden to access them via an automated method, for example search engines like google or bing.

对于一种工具,我需要从Google抓取几个关键字的估计结果统计量.它将类似于以下内容:模拟访问google.com的浏览器,输入关键字并抓取结果,然后稍作停顿后键入下一个关键字,然后抓取结果,依此类推...

For one tool i need to scrape the estimated resultstat from google for several keywords. This will look like the following: simulate the browser that visits google.com and types in a keyword and scrapes the results, then after a little pause type in the next keyword, scrape the results and so on...

我的问题是:网站是否有可能认识到我正在使用硒来模拟浏览器,而不是手动使用浏览器?尤其是Google案给了我一些疑问.我知道硒部分是由Google开发的,或者至少是由一些为Google工作的家伙开发的.那么,硒会留下一些指纹吗?还是无法确定我是独自使用浏览器还是使用硒模拟,甚至对于谷歌也是如此?

My question is: Is it possible for a website to recognize that I'm using selenium to simulate the browser instead of using the browser by hand? Especially the google case gives me some doubts. I know selenium is partly developed by google or at least by some guys working for google. So does leave selenium some fingerprints or isn't it possible to decide if I'm using the browser by myself or simulated by selenium, even for google?

推荐答案

否,没有人真正看到您正在使用Selenium,而不是自己使用WebDriver手动操作浏览器.我不确定旧的Selenium RC,但是应该是相同的方式.运作方式如下:

No, nobody can actually see that you're using Selenium and not hand-operating the browser yourself with WebDriver. I'm not sure about the old Selenium RC, but it should be the same way. Here's how it works:

  1. Selenium打开一个带有干净配置文件(或您选择的配置文件)的浏览器
  2. Selenium已连接到浏览器,因此它可以操纵它,对其进行控制.但是浏览器仍然可以完成大部分工作.基本上,Selenium取代了浏览器的用户输入,但没有更多.

通过阅读浏览器发送的HTTP标头的内容,您可以轻松地对此进行验证.

You can easily verify this by reading the contents of the HTTP headers sent by your browser.

如果您确实需要服务器识别Selenium,则可以使用Browsermob-proxy 向您的请求添加自定义标头.

If you ever actually needed Selenium to be recognized by your server, you can use Browsermob-proxy and add a custom header to your requests.

所有这些,您必须了解一件事.虽然无法直接检测Selenium,但您访问的网站可能会收集一些间接线索.这些通常包括扫描几乎所有时间内发出的太多请求-这可能对您来说是个问题.确保您的Selenium表现得像用户一样.

All that said, there is one thing you must be aware of. While there's no way to detect Selenium directly, there can be some indirect clues picked up by the website you're visiting. Those usually include scanning for too many requests made in virtually no time - this might be an issue for you. Make sure your Selenium is behaving like a user.

编辑2016/04:

大约 是可能的,例如 https://stackoverflow.com/a/33403473/2930045指出公司可以做到.我的猜测-只能猜测-他们可以运行一些Selenium安装到浏览器中的JS进行操作.

Apparanetly it is possible as https://stackoverflow.com/a/33403473/2930045 states that a company can do it. My guess - and it is nothing but a guess - is that they can run some JS that Selenium installs into the browser to operate.

这篇关于使用Selenium的浏览器自动化:指纹,可识别性和可追溯性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-27 02:13