刮取表格时，如何避免将来自不同选项卡的数据合并在一个单元格中?

本文介绍了刮取表格时，如何避免将来自不同选项卡的数据合并在一个单元格中?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我抓取了此页面 https://www.capfriendly.com/teams/bruins ，尤其是在"Cap Hit"(Fowards，Defense，GoalTenders)标签下查找表格.

I scraped this page https://www.capfriendly.com/teams/bruins, specifically looking for the tables under the tab Cap Hit (Fowards, Defense, GoalTenders).

我使用Python和BeautifulSoup4以及CSV作为输出格式.

I used Python and BeautifulSoup4 and CSV as the output format.

import requests, bs4

r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")

with open("csvfile.csv", "w", newline='') as team_data: 
    for tr in table('tr', class_=['odd', 'even']): # get all tr whose class is odd or even 
        row = [td.text for td in tr('td')] # extract td's text 
        writer = csv.writer(team_data) 
        writer.writerow(row)

这是我得到的输出:

['Krejci, David "A"', 'NMC', 'C', 'NHL', '30', '$7,250,000$7,250,000NMC', '$7,250,000$7,500,000NMC', '$7,250,000$7,500,000NMC', '$7,250,000$7,000,000Modified NTC', '$7,250,000$7,000,000Modified NTC', 'UFA', '']
['Bergeron, Patrice "A"', 'NMC', 'C', 'NHL', '31', '$6,875,000$8,750,000NMC', '$6,875,000$8,750,000NMC', '$6,875,000$6,875,000$6,000,000NMC', '$6,875,000$4,375,000$3,500,000NMC', '$6,875,000$4,375,000$1,000,000Modified NTC, NMC', '$6,875,000$4,375,000$1,000,000Modified NTC, NMC', 'UFA']
['Backes, David', 'NMC', 'C, RW', 'NHL', '32', '$6,000,000$8,000,000$3,000,000NMC', '$6,000,000$8,000,000$3,000,000NMC', '$6,000,000$6,000,000$3,000,000NMC', '$6,000,000$4,000,000$3,000,000Modified NTC', '$6,000,000$4,000,000$1,000,000Modified NTC', 'UFA', '']
['Marchand, Brad', 'M-NTC', 'LW', 'NHL', '28', '$4,500,000$5,000,000Modified NTC', '$6,125,000$8,000,000$4,000,000NMC', '$6,125,000$8,000,000$3,000,000NMC', '$6,125,000$7,500,000$4,000,000NMC', '$6,125,000$5,000,000$1,000,000NMC', '$6,125,000$6,500,000$4,000,000NMC', '$6,125,000$5,000,000$3,000,000Modified NTC']

如您所见，来自不同选项卡的数据是串联在一起的:

As you can see data from different tabs is concatenated together:

'$7,250,000$7,000,000Modified NTC'

有人建议我使用JavaScript刮擦桌子，它应该解决我的问题吗?

Somebody advised me to use javascript to scrape the table and that it should solve my problem?

推荐答案

基于源代码，这是特定行中的一些文本，这些文本在条件上可见，具体取决于您所使用的选项卡(如标题所示).如果打算将.hide类隐藏在该特定选项卡上，则将其添加到td的子元素中.

Based on the source code, this is some text in specific rows that is conditionally visible depending on what tab you're on (as your title states). The class .hide is added to the child element in the td when it is intended to be hidden on that specific tab.

在解析td元素以检索文本时，可以过滤掉那些可能被隐藏的元素.这样一来，您就可以检索到好像在Web浏览器中查看页面一样可见的文本.

When you're parsing the td elements to retreive the text, you could filter out those elements which are suppose to be hidden. In doing so, you can retrieve the text that would be visible as if you were viewing the page in a web browser.

在下面的代码段中，我添加了一个parse_td函数，该函数过滤出具有hide类的子span元素.从那里，返回相应的文本.

In the snippet below, I added a parse_td function which filters out the children span elements with a class of hide. From there, the corresponding text is returned.

import requests, bs4, csv

r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")

with open("csvfile.csv", "w", newline='') as team_data: 
    def parse_td(td):
        filtered_data = [tag.text for tag in td.find_all('span', recursive=False)
                         if 'hide' not in tag.attrs['class']]
        return filtered_data[0] if filtered_data else td.text;

    for tr in table('tr', class_=['odd', 'even']):
        row = [parse_td(td) for td in tr('td')]
        writer = csv.writer(team_data)
        writer.writerow(row)

这篇关于刮取表格时，如何避免将来自不同选项卡的数据合并在一个单元格中?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！