使用Beautifulsoup获取多个网址-在wp插件中收集元数据-按时间戳排序

本文介绍了使用Beautifulsoup获取多个网址-在wp插件中收集元数据-按时间戳排序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从网站上抓取一小部分信息:但是它会一直显示无"，就好像标题或任何替换我的标签不存在一样.

i am trying to scrape a little chunk of information from a site: but it keeps printing "None" as if the title, or any tag if i replace it, doesn't exists.

项目:有关wordpress-plugins的元数据列表:-大约有50个插件值得关注！但挑战是:我想获取所有现有插件的元数据.在获取之后，我随后想要过滤的是-具有最新时间戳的那些插件-最近(最新)进行了更新.完全是残酷的...

The project: for a list of meta-data of wordpress-plugins: - approx 50 plugins are of interest! but the challenge is: i want to fetch meta-data of all the existing plugins. What i subsequently want to filter out after the fetch is - those plugins that have the newest timestamp - that are updated (most) recently. It is all aobut acutality...

https://wordpress.org/plugins/wp-job-manager
https://wordpress.org/plugins/ninja-forms
https://wordpress.org/plugins/participants-database ....and so on and so forth.

我们为每个wordpress-plugin提供了以下元数据集:

we have the following set of meta-data for each wordpress-plugin:

Version: 1.9.5.12 
installations: 10,000+    
WordPress Version: 5.0 or higher 
Tested up to: 5.4 PHP  
Version: 5.6 or higher    
Tags 3 Tags:databasemembersign-up formvolunteer
Last updated: 19 hours ago
enter code here

该项目包括两个部分:循环部分:(这似乎很简单).解析器部分:我有一些问题-见下文.我试图遍历URL数组，并从wordpress-plugins列表中抓取以下数据.请参阅下面的循环-

the project consits of two parts: the looping-part: (which seems to be pretty straightforward). the parser-part: where i have some issues - see below. I'm trying to loop through an array of URLs and scrape the data below from a list of wordpress-plugins. See my loop below-

from bs4 import BeautifulSoup

import requests

#array of URLs to loop through, will be larger once I get the loop working correctly

plugins = ['https://wordpress.org/plugins/wp-job-manager', 'https://wordpress.org/plugins/ninja-forms']

可以这样做

ttt = page_soup.find("div", {"class":"plugin-meta"})
text_nodes = [node.text.strip() for node in ttt.ul.findChildren('li')[:-1:2]]

text_nodes的输出:

the Output of text_nodes:

['版本:1.9.5.12'，'活动安装:10,000+'，'经过测试:5.6']

['Version: 1.9.5.12', 'Active installations: 10,000+', 'Tested up to: 5.6 ']

但是，如果我们想获取所有wordpress-plugins的数据并对其进行排序，以显示-让我们说-最新的50个更新的插件.这将是一个有趣的任务:

but if we want to fetch the data of all the wordpress-plugins and subesquently sort them to show the -let us say - latest 50 updated plugins. This would be a interesting task:

首先我们需要获取网址

first of all we need to fetch the urls

然后我们获取信息，并且必须整理出最新-最新时间戳.即最近更新的插件

then we fetch the information and have to sort out the newest- the newest timestamp. Ie the plugin that updated most recently

列出50个最新项-这是最近更新的50个插件...

List the 50 newest items - that are the 50 plugins that are updated recently ...

挑战:如何避免在获取所有URL时使RAM超载. (请参见如何使用BeautifulSoup提取网站中的所有URL 具有有趣的见解，方法和想法.

challenge: how to avoid that we overload the RAM while fetching all URLs. (see here How extract all URLs in a website using BeautifulSoup with interesting insights, approaches and ideas.

目前，我试图弄清楚如何获取所有网址-并解析它们:

at the moment i try to figure out how to fetch all the urls -and to parse them:

a. how to fetch the meta-data of each plugin: 
b. and how to sort out the range of the newest updates… 
c. afterward how to pick out the 50 newest

推荐答案

import requests
from bs4 import BeautifulSoup
from concurrent.futures.thread import ThreadPoolExecutor

url = "https://wordpress.org/plugins/browse/popular/{}"


def main(url, num):
    with requests.Session() as req:
        print(f"Collecting Page# {num}")
        r = req.get(url.format(num))
        soup = BeautifulSoup(r.content, 'html.parser')
        link = [item.get("href")
                for item in soup.findAll("a", rel="bookmark")]
        return set(link)


with ThreadPoolExecutor(max_workers=20) as executor:
    futures = [executor.submit(main, url, num)
               for num in [""]+[f"page/{x}/" for x in range(2, 50)]]

allin = []
for future in futures:
    allin.extend(future.result())


def parser(url):
    with requests.Session() as req:
        print(f"Extracting {url}")
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')
        target = [item.get_text(strip=True, separator=" ") for item in soup.find(
            "h3", class_="screen-reader-text").find_next("ul").findAll("li")[:8]]
        head = [soup.find("h1", class_="plugin-title").text]
        new = [x for x in target if x.startswith(
            ("V", "Las", "Ac", "W", "T", "P"))]
        return head + new


with ThreadPoolExecutor(max_workers=50) as executor1:
    futures1 = [executor1.submit(parser, url) for url in allin]

for future in futures1:
    print(future.result())

输出:查看在线

这篇关于使用Beautifulsoup获取多个网址-在wp插件中收集元数据-按时间戳排序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！