我的目标是从csv文件中提取公司名称,并刮擦公司成立的年份以及公司所在的国家/地区。例如,从以下公司,我想返回“ 1989”和“爱尔兰”

http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=allen%20mcguire%20partners

我已经做了一段时间了,使用SO帖子来指导我-但我似乎无法完成。这是Main文件,工作正常,但奇怪的事实是我的标头似乎无法识别,因此我必须使用标头的首字母来获取第一列也是唯一一列,但这对我而言很好。我的问题是,我的网络抓取文件(在此处的主要功能下方打印)找不到并随后返回我想要的值。

from BeautifulSoup import BeautifulSoup
import csv
import urllib
import urllib2
import business_week_test


input_csv = "sample.csv"
output_csv = "BUSINESS_WEEK.csv"

def main():
    with open(input_csv, "rb") as infile:

        input_fields = ("COMPANY_NAME")
        reader = csv.DictReader(infile, fieldnames = input_fields)
        with open(output_csv, "wb") as outfile:
            output_fields = ("COMPANY_NAME","LOCATION", "YEAR_FOUNDED")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            first_row = next(reader)
            for next_row in reader:
                search_term = first_row["C"]
                num_words_in_comp_name = first_row["C"].split()
                num_words_in_comp_name = len(num_words_in_comp_name)
                result = business_week_test.bwt(search_term, num_words_in_comp_name)
                first_row["LOCATION"] = result
                writer.writerow(first_row)
                first_row = next_row

if __name__ == "__main__":


这是Webscraping文件:

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup


def bwt(article, length):
    art2 = article.split()
    #print(art2)
    article1 = urllib.quote(article)
    #print(article1)
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Google Chrome')]

    if (length == 1):
        link = "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=" + art2[0]
    elif (length == 2):
        link = "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=" + art2[0] + "%20" + art2[1]
    elif (length == 3):
        #print(art2[0], art2[1],art2[2])
        link = "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=" + art2[0] + "%20" + art2[1] + "%20" + art2[2]
    #print(link)

    try:
        opener.open(link)
        #print("here")
    except urllib2.HTTPError, err:
        if err.code == 404 or err.code == 400:
            #print("here", link)
            return "NA"
        else:
            raise

    resource = opener.open(link)
    #print(resource)

    data = resource.read()
    resource.close()
    soup = BeautifulSoup(data)
    #print(soup)
    return soup.find('div',id="bodyContent").p

最佳答案

这是获取位置信息和“ A&​​P Group Limited”公司成立年份的示例代码:

import urllib2
from BeautifulSoup import BeautifulSoup

LINK = "http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=1716794"
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Google Chrome')]

soup = BeautifulSoup(opener.open(LINK))

location = soup.find('div', {'itemprop': 'address'}).findAll('p')[-1].text
founded = soup.find('span', {'itemprop': "foundingDate"}).text

print location, founded


印刷品:

United Kingdom 1971


希望能有所帮助。

关于python - Python Web Scraping,在商业周刊上的Beautifulsoup查找公司的成立年份和位置,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/22303041/

10-12 22:46