本文介绍了获取与BeautifulSoup的各个环节,从单页网站(“加载更多”功能)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要刮去一个网站,没有分页即所有环节,有一个加载更多按钮,但URL取决于你有多少数据要求不会改变。

I want to scrape all links from a website that does not have pagination i.e., there's a 'LOAD MORE' button, but the URL does not change depending on how much data you've asked for.

当我 BeautifulSoup 的页面,并要求所有的链接,它只是显示的网站的香草第一​​页上的链接数量。我可以通过旧内容手动点击通过点击加载更多按钮,但它有没有办法让编程吗?

When I BeautifulSoup the page and ask for all the links, it simply displays the amount of links on the vanilla first page of the website. I can manually click through older content by clicking the 'LOAD MORE' button, but it there a way to do so programmatically?

这是我的意思是:

page = urllib2.urlopen('http://www.thedailybeast.com/politics.html')
soup = soup = BeautifulSoup(page)

for link in soup.find_all('a'):
    print link.get('href')

不幸的是有没有网址,负责分页。

And unfortunately there's no URL that is responsible for pagination.

推荐答案

当你点击加载更多按钮,还有一个的 XHR请求的发到 HTTP:/ /www.thedailybeast.com/politics.view.<page_number>.json 端点。您需要模拟,在你的code和解析JSON响应。使用:

When you click "Load More" button, there is an XHR request issued to the http://www.thedailybeast.com/politics.view.<page_number>.json endpoint. You need to simulate that in your code and parse the JSON response. Working example using requests:

import requests

with requests.Session() as session:
    for page in range(1, 10):
        print("Page number #%s" % page)
        response = session.get("http://www.thedailybeast.com/politics.view.%s.json" % page)
        data = response.json()

        for article in data["stream"]:
            print(article["title"])

打印:

Page number #1
The Two Americas Behind Donald Trump and Bernie Sanders
...
Hillary Clinton’s Star-Studded NYC Bash: Katy Perry, Jamie Foxx, and More Toast the Candidate
Why Do These Republicans Hate Maya Angelou’s Post Office?
Page number #2
No, Joe Biden Is Not a Supreme Court Hypocrite
PC Hysteria Claims Another Professor
WHY BLACK CELEB ENDORSEMENTS MATTER MOST
...
Inside Trump’s Make Believe Presidential Addresses
...

这篇关于获取与BeautifulSoup的各个环节,从单页网站(“加载更多”功能)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-27 05:35