


  from bs4 import BeautifulSoup 
import urllib2

url ='http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(),html.parser)
freq = soup.find('div',attrs = {'id':'frequenz'})
print freq


 < div id =frequenzstyle =font-size: 500%; font-weight:bold; width:100%; height:10%; margin-top:5px; text-align:center> tempsensor< / div> 


现在我的问题是:如何让Python显示更新后的数值?如何在BeautifulSoup中获取自动更新为 tempsensor 的值?





  1. 连接到网页服务器,提取数据

  2. 解析HTML内容和CSS格式并呈现网页

  3. 解析Javascript内容,运行它。

  4. 为浏览器导航,HTML表单和JavaScript程序的事件API等提供用户交互

当然?现在看看你的代码。 BS4甚至不包含第一步,即获取网页,要做到这一点,您必须使用 urllib2

动态网站通常包含Javascript,以便在浏览器上运行并定期更新内容。 BS4没有提供,所以你不会看到它们,而且绝不会只使用BS4。为什么?因为上面的项目(3),下载和执行Javascript程序没有发生。它会在IE浏览器,Firefox或Chrome浏览器中占有一席之地,这就是为什么这些网站会显示动态内容,而只有BS4的抓取功能不会显示它。



在评论中,@Cyphase建议您需要的确切数据可能位于不同的URL上,在这种情况下,它可能会被urllib2 / BS4提取和分析。这可以通过仔细检查在网站上运行的Javascript来确定,特别是您可以查找 setTimeout setInterval 哪个计划更新,或者 ajax ,或者jQuery的 .load 函数用于从后端获取数据。用于更新动态内容的Javascript通常只会从同一网站的后端URL中获取数据。如果他们使用jQuery $('#frequenz')引用div,并通过在JS中搜索此代码,您可以找到更新div的代码。如果没有jQuery,JS更新可能会使用 document.getElementById('frequenz')

I wrote following python code:

from bs4 import BeautifulSoup
import urllib2

url= 'http://www.example.com'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read(),"html.parser")
freq=soup.find('div', attrs={'id':'frequenz'})
print freq

The result is:

<div id="frequenz" style="font-size:500%; font-weight: bold; width: 100%; height: 10%; margin-top: 5px; text-align: center">tempsensor</div>

When I look at this site with a web browser, the web page shows a dynamic content, not the string 'tempsensor'. The temperature value is automatically refreshed every second. So something in the web page isreplacing the string 'tempsensor' with a numerical value automatically.

My problem is now: How can I get Python to show the updated numerical value? How can I obtain the value of the automatic update to tempsensor in BeautifulSoup?


Sorry No, Not possible with BeautifulSoup alone.

The problem is that BS4 is not a complete web browser. It is only an HTML parser. It doesn't parse CSS, nor Javascript.

A complete web browser does at least four things:

  1. Connects to web servers, fetches data
  2. Parses HTML content and CSS formatting and presents a web page
  3. Parses Javascript content, runs it.
  4. Provides for user interaction for things like Browser Navigation, HTML Forms and an events API for the Javascript program

Still not sure? Now look at your code. BS4 does not even include the first step, fetching the web page, to do that you had to use urllib2.

Dynamic sites usually include Javascript to run on the browser and periodically update contents. BS4 doesn't provide that, and so you won't see them, and furthermore never will by using only BS4. Why? Because item (3) above, downloading and executing the Javascript program is not happening. It would be happing in IE, Firefox, or Chrome, and that's why those work to show dynamic content while the BS4-only scraping does not show it.

PhantomJS and CasperJS provide a more mechanized browser that often can run the JavaScript codes enabling dynamic websites. But CasperJS and PhantomJS are programmed in server-side Javascript, not Python.

Apparently, some people are using a browser built into PyQt4 for these kinds of dynamic screenscaping tasks, isolating part of the DOM, and sending that to BS4 for parsing. That might allow for a Python solution.

In comments, @Cyphase suggests that the exact data you want might be available at a different URL, in which case it might be fetched and parsed with urllib2/BS4. This can be determined by careful examination of the Javascript that is running at a site, particularly you could look for setTimeout and setInterval which schedules updates, or ajax, or jQuery's .load function for fetching data from the back end. Javascripts for updates of dynamic content will usually only fetch data from back-end URLs of the same web site. If they use jQuery $('#frequenz') refers to the div, and by searching for this in the JS you may find the code that updates the div. Without jQuery the JS update would probably use document.getElementById('frequenz').


09-12 17:56