问题描述
我写了下面的python代码:
from bs4 import BeautifulSoup
import urllib2
url ='http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(),html.parser)
freq = soup.find('div',attrs = {'id':'frequenz'})
print freq
结果是:
< div id =frequenzstyle =font-size: 500%; font-weight:bold; width:100%; height:10%; margin-top:5px; text-align:center> tempsensor< / div>
当我使用网络浏览器查看本网站时,网页显示的是动态内容,而不是字符串'tempsensor'。温度值每秒自动刷新一次。因此,网页中的某些内容是
,会自动用数字值替换字符串'tempsensor'。
现在我的问题是:如何让Python显示更新后的数值?如何在BeautifulSoup中获取自动更新为 tempsensor
的值?
抱歉不,不可以使用BeautifulSoup。
问题是BS4不是一个完整的网络浏览器。它只是一个HTML解析器。它不解析CSS,也不解析Javascript。
一个完整的网络浏览器至少有四件事:
- 连接到网页服务器,提取数据
- 解析HTML内容和CSS格式并呈现网页
- 解析Javascript内容,运行它。
- 为浏览器导航,HTML表单和JavaScript程序的事件API等提供用户交互
当然?现在看看你的代码。 BS4甚至不包含第一步,即获取网页,要做到这一点,您必须使用 urllib2
。
动态网站通常包含Javascript,以便在浏览器上运行并定期更新内容。 BS4没有提供,所以你不会看到它们,而且绝不会只使用BS4。为什么?因为上面的项目(3),下载和执行Javascript程序没有发生。它会在IE浏览器,Firefox或Chrome浏览器中占有一席之地,这就是为什么这些网站会显示动态内容,而只有BS4的抓取功能不会显示它。
和提供更加机械化的浏览器,通常可以运行启用动态网站的JavaScript代码。但CasperJS和PhantomJS在服务器端Javascript中编程,而不是Python。
显然,有些人是,隔离部分DOM,然后将其发送给BS4进行解析。这可能会允许Python解决方案。
在评论中,@Cyphase建议您需要的确切数据可能位于不同的URL上,在这种情况下,它可能会被urllib2 / BS4提取和分析。这可以通过仔细检查在网站上运行的Javascript来确定,特别是您可以查找 setTimeout
和 setInterval
哪个计划更新,或者 ajax
,或者jQuery的 .load
函数用于从后端获取数据。用于更新动态内容的Javascript通常只会从同一网站的后端URL中获取数据。如果他们使用jQuery $('#frequenz')
引用div,并通过在JS中搜索此代码,您可以找到更新div的代码。如果没有jQuery,JS更新可能会使用 document.getElementById('frequenz')
。
I wrote following python code:
from bs4 import BeautifulSoup
import urllib2
url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(),"html.parser")
freq=soup.find('div', attrs={'id':'frequenz'})
print freq
The result is:
<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>
When I look at this site with a web browser, the web page shows a dynamic content, not the string 'tempsensor'. The temperature value is automatically refreshed every second. So something in the web page isreplacing the string 'tempsensor' with a numerical value automatically.
My problem is now: How can I get Python to show the updated numerical value? How can I obtain the value of the automatic update to tempsensor
in BeautifulSoup?
Sorry No, Not possible with BeautifulSoup alone.
The problem is that BS4 is not a complete web browser. It is only an HTML parser. It doesn't parse CSS, nor Javascript.
A complete web browser does at least four things:
- Connects to web servers, fetches data
- Parses HTML content and CSS formatting and presents a web page
- Parses Javascript content, runs it.
- Provides for user interaction for things like Browser Navigation, HTML Forms and an events API for the Javascript program
Still not sure? Now look at your code. BS4 does not even include the first step, fetching the web page, to do that you had to use urllib2
.
Dynamic sites usually include Javascript to run on the browser and periodically update contents. BS4 doesn't provide that, and so you won't see them, and furthermore never will by using only BS4. Why? Because item (3) above, downloading and executing the Javascript program is not happening. It would be happing in IE, Firefox, or Chrome, and that's why those work to show dynamic content while the BS4-only scraping does not show it.
PhantomJS and CasperJS provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. But CasperJS and PhantomJS are programmed in server-side Javascript, not Python.
Apparently, some people are using a browser built into PyQt4 for these kinds of dynamic screenscaping tasks, isolating part of the DOM, and sending that to BS4 for parsing. That might allow for a Python solution.
In comments, @Cyphase suggests that the exact data you want might be available at a different URL, in which case it might be fetched and parsed with urllib2/BS4. This can be determined by careful examination of the Javascript that is running at a site, particularly you could look for setTimeout
and setInterval
which schedules updates, or ajax
, or jQuery's .load
function for fetching data from the back end. Javascripts for updates of dynamic content will usually only fetch data from back-end URLs of the same web site. If they use jQuery $('#frequenz')
refers to the div, and by searching for this in the JS you may find the code that updates the div. Without jQuery the JS update would probably use document.getElementById('frequenz')
.
这篇关于从网站上获取温度传感器数据的动态更新的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!