本文介绍了Web似乎是通过Python在Javascript中嵌入的区块链数据抓取方法,这是否是正确的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我引用的是此网址: https://tracker.icon.foundation/block/29562412

如果您向下滚动到交易",则显示2个具有单独链接的交易,这基本上就是我要尝试的交易.如果我尝试一个简单的pd.read_csv(url)命令,它显然会忽略我要查找的数据,因此我认为它可能是基于JavaScript的,而是尝试了以下代码:

来自request_html的

 导入HTMLSession会话= HTMLSession()r = session.get('https://tracker.icon.foundation/block/29562412')r.html.linksr.html.absolute_links 

,我得到的结果是"set()".即使我期望以下几点:

['https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632','https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450fa26a00a26a00a26a00a6e0a6e0e0f0f6a0e0f0f0e6e0e0e0f0f0f0e6e0e0b0e0e0b0e0美国

JavaScript甚至是正确的方法吗?相反,我尝试了BeautifulSoup,也没有发现雪茄.

解决方案

您是对的.该页面是使用JavaScript异步填充的,因此BeautifulSoup和类似工具将无法查看您要抓取的特定内容.

但是,如果您记录浏览器的网络流量,则可以看到对REST API发出了一些(XHR)HTTP GET请求,该请求以JSON形式提供结果.该JSON恰好包含您要查找的信息.实际上,它向各种API端点发出了几个这样的请求,但是我们感兴趣的一个叫做 txList (我猜是"transaction list"的缩写):

  def main():汇入要求url =" https://tracker.icon.foundation/v3/block/txList"参数= {高度":"29562412","page":"1","count":"10"}响应= request.get(URL,params = params)response.raise_for_status()base_url ="https://tracker.icon.foundation/transaction/"用于response.json()["data"]中的交易:打印(base_url + transaction ["txHash"])返回0如果__name__ =="__main__":导入系统sys.exit(main()) 

输出:

  https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e>>> 

I'm referencing this url: https://tracker.icon.foundation/block/29562412

If you scroll down to "Transactions", it shows 2 transactions with separate links, that's essentially what I'm trying to grab. If I try a simple pd.read_csv(url) command, it clearly omits the data I'm looking for, so I thought it might be JavaScript based and tried the following code instead:

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://tracker.icon.foundation/block/29562412')
r.html.links
r.html.absolute_links

and I get the result "set()"even though I was expecting the following:

['https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632', 'https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e']

Is JavaScript even the right approach? I tried BeautifulSoup instead and found no cigar on that end as well.

解决方案

You're right. This page is populated asynchronously using JavaScript, so BeautifulSoup and similar tools won't be able to see the specific content you're trying to scrape.

However, if you log your browser's network traffic, you can see some (XHR) HTTP GET requests being made to a REST API, which serves its results in JSON. This JSON happens to contain the information you're looking for. It actually makes several such requests to various API endpoints, but the one we're interested in is called txList (short for "transaction list" I'm guessing):

def main():

    import requests

    url = "https://tracker.icon.foundation/v3/block/txList"

    params = {
        "height": "29562412",
        "page": "1",
        "count": "10"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    base_url = "https://tracker.icon.foundation/transaction/"

    for transaction in response.json()["data"]:
        print(base_url + transaction["txHash"])

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

https://tracker.icon.foundation/transaction/0x9e5927c83efaa654008667d15b0a223f806c25d4c31688c5fdf34936a075d632
https://tracker.icon.foundation/transaction/0xd64f88fe865e756ac805ca87129bc287e450bb156af4a256fa54426b0e0e6a3e
>>>

这篇关于Web似乎是通过Python在Javascript中嵌入的区块链数据抓取方法,这是否是正确的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

07-15 19:08