本文介绍了Scrapy:在解析其他url之前等待解析特定的url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

限时删除!!

简要说明:

我有一个从 Yahoo! 获取库存数据的 Scrapy 项目.金融.为了让我的项目正常工作,我需要确保库存已存在所需的时间.为此,我首先抓取 CAT(卡特彼勒公司(CAT)-NYSE),获取该时间段内的收盘价数量,然后确保之后抓取的所有股票的收盘价数量与 CAT 相同,从而确保股票已公开交易了所需的时间长度.

问题:

这一切都很好,很花哨,但是我的问题是,在scrapy 完成解析CAT 之前,它开始抓取其他股票并解析它们.这会导致错误,因为在我可以从 CAT 获得所需的收盘价之前,scrapy 正在尝试确定是否有任何其他股票的收盘价与 CAT 相同,但目前尚不存在.

实际问题

如何强制scrapy在开始其他url之前完成解析一个url

我也试过:

def start_requests(self):全局开始时间yield Request('http://finance.yahoo.com/q?s=CAT', self.parse)# 等待 4 秒让 CAT 完成抓取如果 time.time() - start_time >0.2:对于 self.other_urls 中的 i:产量请求(我,self.parse)

但是other_urls中的股票永远不会开始,因为scrapy永远不会回到def start_requests去检查时间是否在0.2以上>

完整代码:

 from scrapy.selector import Selector从scrapy导入请求从scrapy.exceptions导入CloseSpider从Sharpeparser.gen_settings 导入*从十进制导入十进制从scrapy.spider导入蜘蛛从 Sharpeparser.items 导入 SharpeparserItem导入 numpy导入时间如果 data_intervals == "m":required_amount_of_returns = 24elif data_intervals == "w":required_amount_of_returns = 100别的:required_amount_of_returns =计数器 = 1start_time = time.time()类 DnotSpider(蜘蛛):#---->>>确保您缩进 1 ---- >>># ======================================名称 = "不"allowed_domains = ["finance.yahoo.com", "http://eoddata.com/", "ca.finance.yahoo.com"]start_urls = ['http://finance.yahoo.com/q?s=CAT']other_urls = ['http://eoddata.com/stocklist/TSX.htm', 'http://eoddata.com/stocklist/TSX/B.htm', 'http://eoddata.com/stocklist/TSX/C.htm'、'http://eoddata.com/stocklist/TSX/D.htm'、'http://eoddata.com/stocklist/TSX/E.htm'、'http://eoddata.com/stocklist/TSX/F.htm'、'http://eoddata.com/stocklist/TSX/G.htm'、'http://eoddata.com/stocklist/TSX/H.htm'、'http://eoddata.com/stocklist/TSX/I.htm'、'http://eoddata.com/stocklist/TSX/J.htm'、'http://eoddata.com/stocklist/TSX/K.htm'、'http://eoddata.com/stocklist/TSX/L.htm'、'http://eoddata.com/stocklist/TSX/M.htm'、'http://eoddata.com/stocklist/TSX/N.htm'、'http://eoddata.com/stocklist/TSX/O.htm'、'http://eoddata.com/stocklist/TSX/P.htm'、'http://eoddata.com/stocklist/TSX/Q.htm'、'http://eoddata.com/stocklist/TSX/R.htm'、'http://eoddata.com/stocklist/TSX/S.htm'、'http://eoddata.com/stocklist/TSX/T.htm'、'http://eoddata.com/stocklist/TSX/U.htm'、'http://eoddata.com/stocklist/TSX/V.htm'、'http://eoddata.com/stocklist/TSX/W.htm', 'http://eoddata.com/stocklist/TSX/X.htm'、'http://eoddata.com/stocklist/TSX/Y.htm'、'http://eoddata.com/stocklist/TSX/Z.htm''http://eoddata.com/stocklist/NASDAQ/B.htm'、'http://eoddata.com/stocklist/NASDAQ/C.htm'、'http://eoddata.com/stocklist/NASDAQ/D.htm'、'http://eoddata.com/stocklist/NASDAQ/E.htm'、'http://eoddata.com/stocklist/NASDAQ/F.htm'、'http://eoddata.com/stocklist/NASDAQ/G.htm'、'http://eoddata.com/stocklist/NASDAQ/H.htm'、'http://eoddata.com/stocklist/NASDAQ/I.htm'、'http://eoddata.com/stocklist/NASDAQ/J.htm'、'http://eoddata.com/stocklist/NASDAQ/K.htm'、'http://eoddata.com/stocklist/NASDAQ/L.htm'、'http://eoddata.com/stocklist/NASDAQ/M.htm'、'http://eoddata.com/stocklist/NASDAQ/N.htm'、'http://eoddata.com/stocklist/NASDAQ/O.htm', 'http://eoddata.com/stocklist/NASDAQ/P.htm', 'http://eoddata.com/stocklist/NASDAQ/Q.htm', 'http://eoddata.com/stocklist/NASDAQ/R.htm'、'http://eoddata.com/stocklist/NASDAQ/S.htm'、'http://eoddata.com/stocklist/NASDAQ/T.htm'、'http://eoddata.com/stocklist/NASDAQ/U.htm'、'http://eoddata.com/stocklist/NASDAQ/V.htm'、'http://eoddata.com/stocklist/NASDAQ/W.htm'、'http:///eoddata.com/stocklist/NASDAQ/X.htm'、'http://eoddata.com/stocklist/NASDAQ/Y.htm'、'http://eoddata.com/stocklist/NASDAQ/Z.htm'、'http://eoddata.com/stocklist/NYSE/B.htm'、'http://eoddata.com/stocklist/NYSE/C.htm'、'http://eoddata.com/stocklist/NYSE/D.htm'、'http://eoddata.com/stocklist/NYSE/E.htm'、'http://eoddata.com/stocklist/NYSE/F.htm'、'http://eoddata.com/stocklist/NYSE/G.htm'、'http://eoddata.com/stocklist/NYSE/H.htm'、'http://eoddata.com/stocklist/NYSE/I.htm'、'http://eoddata.com/stocklist/NYSE/J.htm'、'http://eoddata.com/stocklist/NYSE/K.htm'、'http://eoddata.com/stocklist/NYSE/L.htm'、'http://eoddata.com/stocklist/NYSE/M.htm'、'http://eoddata.com/stocklist/NYSE/N.htm'、'http://eoddata.com/stocklist/NYSE/O.htm', 'http://eoddata.com/stocklist/NYSE/P.htm', 'http://eoddata.com/stocklist/NYSE/Q.htm', 'http://eoddata.com/stocklist/NYSE/R.htm'、'http://eoddata.com/stocklist/NYSE/S.htm'、'http://eoddata.com/stocklist/NYSE/T.htm'、'http://eoddata.com/stocklist/NYSE/U.htm'、'http://eoddata.com/stocklist/NYSE/V.htm'、'http://eoddata.com/stocklist/NYSE/W.htm'、'http:///eoddata.com/stocklist/NYSE/X.htm', 'http://eoddata.com/stocklist/NYSE/Y.htm', 'http://eoddata.com/stocklist/NYSE/Z.htm','http://eoddata.com/stocklist/HKEX/0.htm'、'http://eoddata.com/stocklist/HKEX/1.htm'、'http://eoddata.com/stocklist/HKEX/2.htm'、'http://eoddata.com/stocklist/HKEX/3.htm'、'http://eoddata.com/stocklist/HKEX/6.htm'、'http://eoddata.com/stocklist/HKEX/8.htm','http://eoddata.com/stocklist/LSE/0.htm'、'http://eoddata.com/stocklist/LSE/1.htm'、'http://eoddata.com/stocklist/LSE/2.htm'、'http://eoddata.com/stocklist/LSE/3.htm'、'http://eoddata.com/stocklist/LSE/4.htm'、'http://eoddata.com/stocklist/LSE/5.htm'、'http://eoddata.com/stocklist/LSE/6.htm'、'http://eoddata.com/stocklist/LSE/7.htm'、'http://eoddata.com/stocklist/LSE/8.htm'、'http://eoddata.com/stocklist/LSE/9.htm'、'http://eoddata.com/stocklist/LSE/A.htm'、'http://eoddata.com/stocklist/LSE/B.htm'、'http://eoddata.com/stocklist/LSE/C.htm'、'http://eoddata.com/stocklist/LSE/D.htm', 'http://eoddata.com/stocklist/LSE/E.htm', 'http://eoddata.com/stocklist/LSE/F.htm', 'http://eoddata.com/stocklist/LSE/G.htm'、'http://eoddata.com/stocklist/LSE/H.htm'、'http://eoddata.com/stocklist/LSE/I.htm'、'http://eoddata.com/stocklist/LSE/G.htm'、'http://eoddata.com/stocklist/LSE/K.htm'、'http://eoddata.com/stocklist/LSE/L.htm'、'http:///eoddata.com/stocklist/LSE/M.htm', 'http://eoddata.com/stocklist/LSE/N.htm'、'http://eoddata.com/stocklist/LSE/O.htm'、'http://eoddata.com/stocklist/LSE/P.htm'、'http://eoddata.com/stocklist/LSE/Q.htm'、'http://eoddata.com/stocklist/LSE/R.htm'、'http://eoddata.com/stocklist/LSE/S.htm'、'http://eoddata.com/stocklist/LSE/T.htm'、'http://eoddata.com/stocklist/LSE/U.htm'、'http://eoddata.com/stocklist/LSE/V.htm'、'http://eoddata.com/stocklist/LSE/W.htm'、'http://eoddata.com/stocklist/LSE/X.htm'、'http://eoddata.com/stocklist/LSE/Y.htm', 'http://eoddata.com/stocklist/LSE/Z.htm','http://eoddata.com/stocklist/AMS/A.htm'、'http://eoddata.com/stocklist/AMS/B.htm'、'http://eoddata.com/stocklist/AMS/C.htm'、'http://eoddata.com/stocklist/AMS/D.htm'、'http://eoddata.com/stocklist/AMS/E.htm'、'http://eoddata.com/stocklist/AMS/F.htm'、'http://eoddata.com/stocklist/AMS/G.htm'、'http://eoddata.com/stocklist/AMS/H.htm'、'http://eoddata.com/stocklist/AMS/I.htm'、'http://eoddata.com/stocklist/AMS/J.htm'、'http://eoddata.com/stocklist/AMS/K.htm'、'http://eoddata.com/stocklist/AMS/L.htm', 'http://eoddata.com/stocklist/AMS/M.htm', 'http://eoddata.com/stocklist/AMS/N.htm', 'http://eoddata.com/stocklist/AMS/O.htm', 'http://eoddata.com/stocklist/AMS/P.htm', 'http://eoddata.com/stocklist/AMS/Q.htm'、'http://eoddata.com/stocklist/AMS/R.htm'、'http://eoddata.com/stocklist/AMS/S.htm'、'http://eoddata.com/stocklist/AMS/T.htm'、'http://eoddata.com/stocklist/AMS/U.htm'、'http://eoddata.com/stocklist/AMS/V.htm'、'http:///eoddata.com/stocklist/AMS/W.htm', 'http://eoddata.com/stocklist/AMS/X.htm'、'http://eoddata.com/stocklist/AMS/Y.htm'、'http://eoddata.com/stocklist/AMS/Z.htm'、'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=A', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=B', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=C', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=D', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=E', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=F', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=G', 'https://ca.finance.yahoo.yahoo.com/q/cp?s=%5EIXIC&alpha=H', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=I', 'https://ca.Finance.yahoo.com/q/cp?s=%5EIXIC&alpha=J', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=K', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=L', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=M', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=N', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=O', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=P', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Q', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=R', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=S', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=T', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=U', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=V', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=W', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=X', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Y', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Z','https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=0', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=1', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=2', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=3','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=B','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=D','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=F','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=H','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=J','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=L','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=N','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=P','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=R','http://finance.yahoo.com/q/cp?s=%5EN100&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=T', 'http://Finance.yahoo.com/q/cp?s=%5EN100&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=V', 'http://Finance.yahoo.com/q/cp?s=%5EN100&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=X', 'http://Finance.yahoo.com/q/cp?s=%5EN100&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=Z','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=B','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=D','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=F','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=H','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=J','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=L','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=N','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=P','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=R','http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5ECHI&alpha=T', 'http://Finance.yahoo.com/q/cp?s=%5EFCHI&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=V', 'http://Finance.yahoo.com/q/cp?s=%5EFCHI&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5ECHI&alpha=X', 'http://Finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Z','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=B','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=D','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=F','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=H','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=J','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=L','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=N','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=P','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=R','http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=T', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=V', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=X', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Z']def start_requests(self):全局开始时间yield Request('http://finance.yahoo.com/q?s=CAT', self.parse)# 等待 4 秒让 CAT 完成抓取如果 time.time() - start_time >0.2:对于 self.other_urls 中的 i:产量请求(我,self.parse)定义解析(自我,响应):如果 response.url 中有eoddata":companyList = response.xpath('//tr[@class="ro"]/td/a/text()').extract()对于 companyList 中的公司:如果 response.url 中有TSX":go = 'http://finance.yahoo.com/q/hp?s={0}.TO&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, begin_month, begin_day, begin_year, end_month, end_day, end_year, data_intervals)产量请求(去,self.stocks1)elif "LSE" in response.url:go = 'http://finance.yahoo.com/q/hp?s={0}.L&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, begin_month, begin_day, begin_year, end_month, end_day, end_year, data_intervals)产量请求(去,self.stocks1)elif "HKEX" in response.url:go = 'http://finance.yahoo.com/q/hp?s={0}.HK&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, begin_month, begin_day, begin_year, end_month, end_day, end_year, data_intervals)产量请求(去,self.stocks1)elif "AMS" in response.url:go = 'https://ca.finance.yahoo.com/q/hp?s={0}.AS&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(公司、开始月份、开始日期、开始年份、结束月份、结束日期、结束年份、数据间隔)产量请求(去,self.stocks1)别的:go = 'https://ca.finance.yahoo.com/q/hp?s={0}&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(公司、开始月份、开始日期、开始年份、结束月份、结束日期、结束年份、数据间隔)产量请求(去,self.stocks1)elif "http://finance.yahoo.com/q?s=CAT" in response.url:go = 'http://finance.yahoo.com/q/hp?s=CAT&a={0}&b={1}&c={2}&d={3}&e={4}&f={5}&g={6}'.format(beginning_month, begin_day, begin_year, end_month,ending_day,ending_year, data_intervals)产量请求(去,self.stocks1)别的:rows = response.xpath('//table[@class="yfnc_tableout1"]//table/tr')[1:]对于行中的行:company = row.xpath('.//td[1]/b/a/text()').extract()go = 'http://finance.yahoo.com/q/hp?s={0}&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(公司、开始日、开始月、开始年、结束日、结束月、结束年、数据间隔)产量请求(去,self.stocks1)def stock1(自我,回应):current_page = response.url打印 current_page# 如果链接和第一页不一样,即.通过stocks2请求stocks1,从stocks2中获取股票数据如果 initial_ending 不在 current_page[-iel:] 中:return_pages = response.meta.get('returns_pages')# 从股票列表中删除最后一个股票价格,因为它与新列表中的第一个相同如果不是没有return_pages:如果 len(returns_pages) >2:返回页数 = 返回页数[:-1]别的:# 否则,如果链接确实与第一页的链接匹配,则创建一个新列表,因为尚不存在返回页面 = []# 这会从页面中抓取股票数据rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]打印股票1"打印return_pages对于行中的行:cell = row.xpath('.//td/text()').extract()尝试:值 = 单元格 [-1]尝试:浮动(值)# 并将其添加到returns_pagesreturn_pages.append(values)除了值错误:继续除了值错误:继续打印之后"打印return_pages# exp 判断是否有下一页"exp = response.xpath('//td[@align="right"]/a[@rel="next"]').extract()# 如果有下一页":如果不是 exp:# 这是第一页:如果 initial_ending 在 current_page[-iel:] 中:#为第二页创建必要的urlnext_page = current_page + "&z=66&y=66"# 如果这不是第一页别的:# 这会将链接的末尾增加 66,从而获得第 2 页及之后的下一个 66 个结果u = int(current_page[-6:].split("=",1)[1])o = len(str(u))你 += 66next_page = current_page[:-o] + str(u)打印 next_page, "66&y in curr_page"# 然后回到self.stocks1获取下一页的更多数据收益率请求(next_page,self.stocks2,meta={'returns_pages':returns_pages},dont_filter=True)# 否则,如果没有下一个链接"别的:# 将返回值发送到 finalize.stock 以保存在项目中产量请求(当前页面,回调=self.finalize_stock,元={'returns_pages':returns_pages},dont_filter=True)def stock2(自我,回应):# 打印当前url的链接current_page = response.url打印 current_page# 获取上一页的返回值return_pages = response.meta.get('returns_pages')# 删除上一页的最后一个返回,因为它将是重复的返回页数 = 返回页数[:-1]打印stocks2"打印return_pages# 获取页面上的所有返回rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]对于行中的行:cell = row.xpath('.//td/text()').extract()尝试:值 = 单元格 [-1]尝试:浮动(值)# 并将其添加到之前的返回值中return_pages.append(values)除了值错误:继续除了值错误:继续打印后 2"打印return_pages# exp 判断是否有下一页"exp = response.xpath('//td[@align="right"]/a[@rel="next"]').extract()# 如果有下一页":如果不是 exp:# 不知何故,这是第一页(不应该是真的)如果 initial_ending 在 current_page[-iel:] 中:# 添加必要的链接以转到第二页next_page = current_page + "&z=66&y=66"打印 next_page, "66&y not in curr_page"# 否则,这不是第一页(应该总是如此)别的:# 将 66 添加到前一个链接的最后一个数字以访问第二个或以后的页面u = int(current_page[-6:].split("=",1)[1])o = len(str(u))你 += 66next_page = current_page[:-o] + str(u)打印 next_page, "66&y in curr_page"# 返回 self.stocks1 以获取下一页的更多数据产量请求(next_page,self.stocks1,meta={'returns_pages':returns_pages},dont_filter=True)别的:# 如果没有Next"链接,将retuns发送到finalize.stock保存在item中产量请求(当前页面,回调=self.finalize_stock,元={'returns_pages':returns_pages},dont_filter=True)打印发送以完成库存"def finalize_stock(self,response):current_page = response.url打印===================="打印finalize_stock 调用"打印 current_page打印===================="unformatted_returns = response.meta.get('returns_pages')返回 = [float(i) for i in unformatted_returns]全局 required_amount_of_returns,计数器如果 counter == 1 和 response.url 中的CAT":required_amount_of_returns = len(返回)elif required_amount_of_returns == 0:raise CloseSpider("'启动所需的返回数量时出错'")计数器 += 1打印计数器# 计算收益率的迭代器# ====================================如果 data_intervals == "m":k = 12elif data_intervals == "w":k = 4别的:k = 30sub_returns_amount = required_amount_of_returns - ksub_returns = 回报[:sub_returns_amount]rate_of_return = []RFR = 0.03# 确保列表长度准确,否则 rate_of_return 将不准确# 返回值还没有被管道检查过,所以小列表将在变量中如果 len(returns) >required_amount_of_returns:对于 sub_returns 中的数字:分子 = 数字 - 返回 [k]比率=分子/回报[k]如果率=='':比率 = 0rate_of_return.append(rate)k += 1item = SharpeparserItem()项目 = []item['url'] = response.urlitem['name'] = response.xpath('//div[@class="title"]/h2/text()').extract()item['avg_returns'] = numpy.average(rate_of_return)item['var_returns'] = numpy.cov(rate_of_return)item['sd_returns'] = numpy.std(rate_of_return)item['returns'] = unformatted_returnsitem['rate_of_returns'] = rate_of_returnitem['exchange'] = response.xpath('//span[@class="rtq_exch"]/text()').extract()item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR)/numpy.std(rate_of_return))items.append(item)产量项目
解决方案

实际问题

至于按顺序执行每个请求的实际问题...有几个问题与您的类似:

总的来说,似乎有几个选项:

  1. 利用 start_requests() 函数中的 priority 标志以特定顺序遍历网站
  2. 设置CONCURRENT_REQUESTS=1 确保一次只执行一个请求
  3. 如果您想在第一个 CAT 代码完成后一次性解析所有站点.如果您已经使用

通用编码

我无法运行您的确切代码,因为您缺少类结构,但我已经可以看到一些可能会绊倒您的事情:

  1. 这篇 SO 帖子 描述了 yield.为了更好地了解您的 yield 函数是如何工作的,请运行以下命令:

    def it():产量范围(2)产量范围(10)g = 它()对于 g 中的 i:打印我# 现在生成器已被消耗.对于 g 中的 i:打印我

  2. 这个 SO帖子也演示了 start_requests() 函数覆盖了由 start_urls 指定的列表.由于这个原因,您在 start_urls 中的 url 被这个函数覆盖,该函数只产生 Request('http://finance.yahoo.com/q?s=CAT', self.parse)

  3. 是否有任何特殊原因导致您没有按照您希望解析的顺序列出 start_urls 列表中的所有 url,并删除函数 start_requests()>? 上的文档start_urls 状态:

    后续的 URL 将根据包含在起始 URL 中的数据连续生成

  4. globals 中粘贴东西往往会导致你在这样的项目中出现问题,通常最好在 中将它们作为 self 的属性启动def __init__(self): 类被调用时调用的函数.

  5. 这可能是小事,但通过在单独的文件中列出所有符号,然后将它们加载到代码中,您可以节省大量的滚动/精力.就目前情况而言,该列表中有很多重复内容,您可以将其剪掉并使其更易于阅读.

Brief Explanation:

I have a Scrapy project that takes stock data from Yahoo! Finance. In order for my project to work, I need to ensure that a stock has been around for a desired amount of time. I do this by scraping CAT (Caterpillar Inc. (CAT) -NYSE) first, get the amount of closing prices that there is for that time period, and then ensure that all stocks scraped after that have the same amount of closing prices as CAT, thus ensuring that a stock has been publicly traded for the desired time length.

The Problem:

This all works fine and dandy, however my problem is that before scrapy has finished parsing CAT, it begins scraping other stocks and parsing them. This results in an error, as before I can get the desired amount of closing prices from CAT, scrapy is trying to decide if any other stock has the same amount of closing prices as CAT, which does not exist yet.

The actual question

How can I force scrapy to finish parsing one url before beginning others

I have also tried:

def start_requests(self):
    global start_time
    yield Request('http://finance.yahoo.com/q?s=CAT', self.parse)
    # Waits 4 seconds to allow CAT to finish crawling
    if time.time() - start_time > 0.2:
        for i in self.other_urls:
            yield Request(i, self.parse)

but the stocks in other_urls never commence, because scrapy never goes back to def start_requests to check if the time is above 0.2

The Entire Code:

  from scrapy.selector import Selector
from scrapy import Request
from scrapy.exceptions import CloseSpider
from sharpeparser.gen_settings import *
from decimal import Decimal
from scrapy.spider import Spider
from sharpeparser.items import SharpeparserItem
import numpy
import time

if data_intervals == "m":
    required_amount_of_returns = 24
elif data_intervals == "w":
    required_amount_of_returns = 100
else:
    required_amount_of_returns =

counter = 1
start_time = time.time()


class DnotSpider(Spider):

# ---- >>> ENSURE YOU INDENT 1 ---- >>>
# =======================================
name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/", "ca.finance.yahoo.com"]
start_urls = ['http://finance.yahoo.com/q?s=CAT']
other_urls = ['http://eoddata.com/stocklist/TSX.htm', 'http://eoddata.com/stocklist/TSX/B.htm', 'http://eoddata.com/stocklist/TSX/C.htm', 'http://eoddata.com/stocklist/TSX/D.htm', 'http://eoddata.com/stocklist/TSX/E.htm', 'http://eoddata.com/stocklist/TSX/F.htm', 'http://eoddata.com/stocklist/TSX/G.htm', 'http://eoddata.com/stocklist/TSX/H.htm', 'http://eoddata.com/stocklist/TSX/I.htm', 'http://eoddata.com/stocklist/TSX/J.htm', 'http://eoddata.com/stocklist/TSX/K.htm', 'http://eoddata.com/stocklist/TSX/L.htm', 'http://eoddata.com/stocklist/TSX/M.htm', 'http://eoddata.com/stocklist/TSX/N.htm', 'http://eoddata.com/stocklist/TSX/O.htm', 'http://eoddata.com/stocklist/TSX/P.htm', 'http://eoddata.com/stocklist/TSX/Q.htm', 'http://eoddata.com/stocklist/TSX/R.htm', 'http://eoddata.com/stocklist/TSX/S.htm', 'http://eoddata.com/stocklist/TSX/T.htm', 'http://eoddata.com/stocklist/TSX/U.htm', 'http://eoddata.com/stocklist/TSX/V.htm', 'http://eoddata.com/stocklist/TSX/W.htm', 'http://eoddata.com/stocklist/TSX/X.htm', 'http://eoddata.com/stocklist/TSX/Y.htm', 'http://eoddata.com/stocklist/TSX/Z.htm'
    'http://eoddata.com/stocklist/NASDAQ/B.htm', 'http://eoddata.com/stocklist/NASDAQ/C.htm', 'http://eoddata.com/stocklist/NASDAQ/D.htm', 'http://eoddata.com/stocklist/NASDAQ/E.htm', 'http://eoddata.com/stocklist/NASDAQ/F.htm', 'http://eoddata.com/stocklist/NASDAQ/G.htm', 'http://eoddata.com/stocklist/NASDAQ/H.htm', 'http://eoddata.com/stocklist/NASDAQ/I.htm', 'http://eoddata.com/stocklist/NASDAQ/J.htm', 'http://eoddata.com/stocklist/NASDAQ/K.htm', 'http://eoddata.com/stocklist/NASDAQ/L.htm', 'http://eoddata.com/stocklist/NASDAQ/M.htm', 'http://eoddata.com/stocklist/NASDAQ/N.htm', 'http://eoddata.com/stocklist/NASDAQ/O.htm', 'http://eoddata.com/stocklist/NASDAQ/P.htm', 'http://eoddata.com/stocklist/NASDAQ/Q.htm', 'http://eoddata.com/stocklist/NASDAQ/R.htm', 'http://eoddata.com/stocklist/NASDAQ/S.htm', 'http://eoddata.com/stocklist/NASDAQ/T.htm', 'http://eoddata.com/stocklist/NASDAQ/U.htm', 'http://eoddata.com/stocklist/NASDAQ/V.htm', 'http://eoddata.com/stocklist/NASDAQ/W.htm', 'http://eoddata.com/stocklist/NASDAQ/X.htm', 'http://eoddata.com/stocklist/NASDAQ/Y.htm', 'http://eoddata.com/stocklist/NASDAQ/Z.htm',
    'http://eoddata.com/stocklist/NYSE/B.htm', 'http://eoddata.com/stocklist/NYSE/C.htm', 'http://eoddata.com/stocklist/NYSE/D.htm', 'http://eoddata.com/stocklist/NYSE/E.htm', 'http://eoddata.com/stocklist/NYSE/F.htm', 'http://eoddata.com/stocklist/NYSE/G.htm', 'http://eoddata.com/stocklist/NYSE/H.htm', 'http://eoddata.com/stocklist/NYSE/I.htm', 'http://eoddata.com/stocklist/NYSE/J.htm', 'http://eoddata.com/stocklist/NYSE/K.htm', 'http://eoddata.com/stocklist/NYSE/L.htm', 'http://eoddata.com/stocklist/NYSE/M.htm', 'http://eoddata.com/stocklist/NYSE/N.htm', 'http://eoddata.com/stocklist/NYSE/O.htm', 'http://eoddata.com/stocklist/NYSE/P.htm', 'http://eoddata.com/stocklist/NYSE/Q.htm', 'http://eoddata.com/stocklist/NYSE/R.htm', 'http://eoddata.com/stocklist/NYSE/S.htm', 'http://eoddata.com/stocklist/NYSE/T.htm', 'http://eoddata.com/stocklist/NYSE/U.htm', 'http://eoddata.com/stocklist/NYSE/V.htm', 'http://eoddata.com/stocklist/NYSE/W.htm', 'http://eoddata.com/stocklist/NYSE/X.htm', 'http://eoddata.com/stocklist/NYSE/Y.htm', 'http://eoddata.com/stocklist/NYSE/Z.htm',
    'http://eoddata.com/stocklist/HKEX/0.htm', 'http://eoddata.com/stocklist/HKEX/1.htm', 'http://eoddata.com/stocklist/HKEX/2.htm', 'http://eoddata.com/stocklist/HKEX/3.htm', 'http://eoddata.com/stocklist/HKEX/6.htm', 'http://eoddata.com/stocklist/HKEX/8.htm',
    'http://eoddata.com/stocklist/LSE/0.htm', 'http://eoddata.com/stocklist/LSE/1.htm', 'http://eoddata.com/stocklist/LSE/2.htm', 'http://eoddata.com/stocklist/LSE/3.htm', 'http://eoddata.com/stocklist/LSE/4.htm', 'http://eoddata.com/stocklist/LSE/5.htm', 'http://eoddata.com/stocklist/LSE/6.htm', 'http://eoddata.com/stocklist/LSE/7.htm', 'http://eoddata.com/stocklist/LSE/8.htm', 'http://eoddata.com/stocklist/LSE/9.htm', 'http://eoddata.com/stocklist/LSE/A.htm', 'http://eoddata.com/stocklist/LSE/B.htm', 'http://eoddata.com/stocklist/LSE/C.htm', 'http://eoddata.com/stocklist/LSE/D.htm', 'http://eoddata.com/stocklist/LSE/E.htm', 'http://eoddata.com/stocklist/LSE/F.htm', 'http://eoddata.com/stocklist/LSE/G.htm', 'http://eoddata.com/stocklist/LSE/H.htm', 'http://eoddata.com/stocklist/LSE/I.htm', 'http://eoddata.com/stocklist/LSE/G.htm', 'http://eoddata.com/stocklist/LSE/K.htm', 'http://eoddata.com/stocklist/LSE/L.htm', 'http://eoddata.com/stocklist/LSE/M.htm', 'http://eoddata.com/stocklist/LSE/N.htm', 'http://eoddata.com/stocklist/LSE/O.htm', 'http://eoddata.com/stocklist/LSE/P.htm', 'http://eoddata.com/stocklist/LSE/Q.htm', 'http://eoddata.com/stocklist/LSE/R.htm', 'http://eoddata.com/stocklist/LSE/S.htm', 'http://eoddata.com/stocklist/LSE/T.htm', 'http://eoddata.com/stocklist/LSE/U.htm', 'http://eoddata.com/stocklist/LSE/V.htm', 'http://eoddata.com/stocklist/LSE/W.htm', 'http://eoddata.com/stocklist/LSE/X.htm', 'http://eoddata.com/stocklist/LSE/Y.htm', 'http://eoddata.com/stocklist/LSE/Z.htm',
    'http://eoddata.com/stocklist/AMS/A.htm', 'http://eoddata.com/stocklist/AMS/B.htm', 'http://eoddata.com/stocklist/AMS/C.htm', 'http://eoddata.com/stocklist/AMS/D.htm', 'http://eoddata.com/stocklist/AMS/E.htm', 'http://eoddata.com/stocklist/AMS/F.htm', 'http://eoddata.com/stocklist/AMS/G.htm', 'http://eoddata.com/stocklist/AMS/H.htm', 'http://eoddata.com/stocklist/AMS/I.htm', 'http://eoddata.com/stocklist/AMS/J.htm', 'http://eoddata.com/stocklist/AMS/K.htm', 'http://eoddata.com/stocklist/AMS/L.htm', 'http://eoddata.com/stocklist/AMS/M.htm', 'http://eoddata.com/stocklist/AMS/N.htm', 'http://eoddata.com/stocklist/AMS/O.htm', 'http://eoddata.com/stocklist/AMS/P.htm', 'http://eoddata.com/stocklist/AMS/Q.htm', 'http://eoddata.com/stocklist/AMS/R.htm', 'http://eoddata.com/stocklist/AMS/S.htm', 'http://eoddata.com/stocklist/AMS/T.htm', 'http://eoddata.com/stocklist/AMS/U.htm', 'http://eoddata.com/stocklist/AMS/V.htm', 'http://eoddata.com/stocklist/AMS/W.htm', 'http://eoddata.com/stocklist/AMS/X.htm', 'http://eoddata.com/stocklist/AMS/Y.htm', 'http://eoddata.com/stocklist/AMS/Z.htm',
    'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=A', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=B', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=C', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=D', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=E', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=F', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=G', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=H', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=I', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=J', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=K', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=L', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=M', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=N', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=O', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=P', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Q', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=R', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=S', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=T', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=U', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=V', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=W', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=X', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Y', 'https://ca.finance.yahoo.com/q/cp?s=%5EIXIC&alpha=Z',
    'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=0', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=1', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=2', 'https://ca.finance.yahoo.com/q/cp?s=%5EHSI&alpha=3',
    'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=B', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=D', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=F', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=H', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=J', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=L', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=N', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=P', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=R', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=T', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=V', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=X', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EN100&alpha=Z',
    'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=B', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=D', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=F', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=H', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=J', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=L', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=N', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=P', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=R', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=T', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=V', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=X', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EFCHI&alpha=Z',
    'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=A', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=B', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=C', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=D', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=E', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=F', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=G', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=H', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=I', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=J', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=K', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=L', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=M', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=N', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=O', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=P', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Q', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=R', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=S', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=T', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=U', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=V', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=W', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=X', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Y', 'http://finance.yahoo.com/q/cp?s=%5EAEX&alpha=Z']

def start_requests(self):
    global start_time
    yield Request('http://finance.yahoo.com/q?s=CAT', self.parse)
    # Waits 4 seconds to allow CAT to finish crawling
    if time.time() - start_time > 0.2:
        for i in self.other_urls:
            yield Request(i, self.parse)

def parse(self, response):

    if "eoddata" in response.url:
        companyList = response.xpath('//tr[@class="ro"]/td/a/text()').extract()
        for company in companyList:
            if "TSX" in response.url:
                go = 'http://finance.yahoo.com/q/hp?s={0}.TO&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
                yield Request(go, self.stocks1)
            elif "LSE" in response.url:
                go = 'http://finance.yahoo.com/q/hp?s={0}.L&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
                yield Request(go, self.stocks1)
            elif "HKEX" in response.url:
                go = 'http://finance.yahoo.com/q/hp?s={0}.HK&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
                yield Request(go, self.stocks1)
            elif "AMS" in response.url:
                go = 'https://ca.finance.yahoo.com/q/hp?s={0}.AS&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
                yield Request(go, self.stocks1)
            else:
                go = 'https://ca.finance.yahoo.com/q/hp?s={0}&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
                yield Request(go, self.stocks1)
    elif "http://finance.yahoo.com/q?s=CAT" in response.url:
        go = 'http://finance.yahoo.com/q/hp?s=CAT&a={0}&b={1}&c={2}&d={3}&e={4}&f={5}&g={6}'.format(beginning_month, beginning_day, beginning_year, ending_month, ending_day, ending_year, data_intervals)
        yield Request(go, self.stocks1)
    else:
        rows = response.xpath('//table[@class="yfnc_tableout1"]//table/tr')[1:]
        for row in rows:
            company = row.xpath('.//td[1]/b/a/text()').extract()
            go = 'http://finance.yahoo.com/q/hp?s={0}&a={1}&b={2}&c={3}&d={4}&e={5}&f={6}&g={7}'.format(company, beginning_day, beginning_month, beginning_year, ending_day, ending_month, ending_year, data_intervals)
            yield Request(go, self.stocks1)

def stocks1(self, response):

    current_page = response.url
    print current_page
    # If the link is not the same as the first page, ie. stocks1 is requested through stocks2, get the stock data from stocks2
    if initial_ending not in current_page[-iel:]:
        returns_pages = response.meta.get('returns_pages')
        # Remove the last stock price from the stock list, because it is the same as the first on the new list
        if not not returns_pages:
            if len(returns_pages) > 2:
                returns_pages = returns_pages[:-1]
    else:
        # Else, if the link does match that of the first page, create a new list becuase one does not exist yet
        returns_pages = []

    # This grabs the stock data from the page
    rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]
    print "stocks1"
    print returns_pages
    for row in rows:
        cells = row.xpath('.//td/text()').extract()
        try:
            values = cells[-1]
            try:
                float(values)
                # And adds it to returns_pages
                returns_pages.append(values)
            except ValueError:
                continue
        except ValueError:
            continue
    print "after"
    print returns_pages

    # exp determines if there is a 'Next page' or not
    exp = response.xpath('//td[@align="right"]/a[@rel="next"]').extract()
    # If there is a 'Next Page':
    if not not exp:
        # And this is the first page:
        if initial_ending in current_page[-iel:]:
            #create necessary url for the 2nd page
            next_page = current_page + "&z=66&y=66"
        # If this is not the first page
        else:
            # This increases the end of the link by 66, thereby getting the next 66 results on for pages 2 and after
            u = int(current_page[-6:].split("=",1)[1])
            o = len(str(u))
            u += 66
            next_page = current_page[:-o] + str(u)
            print next_page, "66&y in curr_page"
        # Then go back to self.stocks1 to get more data on the next page
        yield Request(next_page, self.stocks2, meta={'returns_pages': returns_pages}, dont_filter=True)
    # Else, if there is no 'Next Link'
    else:
        # Send the retuns to finalize.stock to be saved in the item
        yield Request(current_page, callback=self.finalize_stock, meta={'returns_pages': returns_pages}, dont_filter=True)

def stocks2(self, response):

    # Prints the link of the current url
    current_page = response.url
    print current_page

    # Gets the returns from the previous page
    returns_pages = response.meta.get('returns_pages')
    # Removes the last return from the previous page because it will be a duplicate
    returns_pages = returns_pages[:-1]
    print "stocks2"
    print returns_pages
    # Gets all of the returns on the page
    rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]
    for row in rows:
        cells = row.xpath('.//td/text()').extract()
        try:
            values = cells[-1]
            try:
                float(values)
                # And adds it to the previous returns
                returns_pages.append(values)
            except ValueError:
                continue
        except ValueError:
            continue

    print "after 2"
    print returns_pages

    # exp determines if there is a 'Next page' or not
    exp = response.xpath('//td[@align="right"]/a[@rel="next"]').extract()
    # If there is a 'Next Page':
    if not not exp:
        # And somehow, this is the first page (should never be true)
        if initial_ending in current_page[-iel:]:
            # Add necessary link to go to the second page
            next_page = current_page + "&z=66&y=66"
            print next_page, "66&y not in curr_page"
        # Else, this is not the first page (should always be true)
        else:
            # add 66 to the last number on the preceeding link in order to access the second or later pages
            u = int(current_page[-6:].split("=",1)[1])
            o = len(str(u))
            u += 66
            next_page = current_page[:-o] + str(u)
            print next_page, "66&y in curr_page"
        # go back to self.stocks1 to get more data on the next page
        yield Request(next_page, self.stocks1, meta={'returns_pages': returns_pages}, dont_filter=True)
    else:
        # If there is no "Next" link, send the retuns to finalize.stock to be saved in the item
        yield Request(current_page, callback=self.finalize_stock, meta={'returns_pages': returns_pages}, dont_filter=True)
        print "sending to finalize stock"

def finalize_stock(self,response):

    current_page = response.url
    print "====================="
    print "finalize_stock called"
    print current_page
    print "====================="
    unformatted_returns = response.meta.get('returns_pages')
    returns = [float(i) for i in unformatted_returns]
    global required_amount_of_returns, counter
    if counter == 1 and "CAT" in response.url:
        required_amount_of_returns = len(returns)
    elif required_amount_of_returns == 0:
        raise CloseSpider("'Error with initiating required amount of returns'")

    counter += 1
    print counter

    # Iterator to calculate Rate of return
    # ====================================
    if data_intervals == "m":
        k = 12
    elif data_intervals == "w":
        k = 4
    else:
        k = 30

    sub_returns_amount = required_amount_of_returns - k
    sub_returns = returns[:sub_returns_amount]
    rate_of_return = []
    RFR = 0.03

    # Make sure list is exact length, otherwise rate_of_return will be inaccurate
    # Returns has not been checked by pipeline yet, so small lists will be in the variable

    if len(returns) > required_amount_of_returns:
        for number in sub_returns:
            numerator = number - returns[k]
            rate = numerator/returns[k]
            if rate == '':
                rate = 0
            rate_of_return.append(rate)
            k += 1

    item = SharpeparserItem()
    items = []
    item['url'] = response.url
    item['name'] = response.xpath('//div[@class="title"]/h2/text()').extract()
    item['avg_returns'] = numpy.average(rate_of_return)
    item['var_returns'] = numpy.cov(rate_of_return)
    item['sd_returns'] = numpy.std(rate_of_return)
    item['returns'] = unformatted_returns
    item['rate_of_returns'] = rate_of_return
    item['exchange'] = response.xpath('//span[@class="rtq_exch"]/text()').extract()
    item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR) / numpy.std(rate_of_return))
    items.append(item)
    yield item
解决方案

Actual Question

As for the actual problem of doing each request in sequence... There are a few questions similar to yours:

As a general summary there seem to be a couple of options:

  1. Utilise the priority flag in a start_requests() function to iterate through websites in a particular order
  2. Set CONCURRENT_REQUESTS=1 to ensure that only one request is carried out at a time
  3. If you want to parse all sites all at once after the first CAT ticker has been done. It might be possible to specify an if function to flick the above setting to a higher value if you have already parsed the first site using the settings API

General Coding

I can't run your exact code because you are missing the class structure but I can already see a few things that might be tripping you up:

  1. This SO post describes yield. to better understand how your yield function is working run the following:

    def it():
        yield range(2)
        yield range(10)
    
    g = it()
    for i in g:
        print i
    # now the generator has been consumed.
    for i in g:
        print i
    

  2. This SO post also demonstrates that the start_requests() function overrides the list specified by start_urls. It appears that for this reason your urls in start_urls are overridden by this function which only ever yields a generator expression of Request('http://finance.yahoo.com/q?s=CAT', self.parse)

  3. Is there any particular reason that you are not listing all the urls in the list start_urls in the order you want them parsed and delete the function start_requests()? The docs on start_urls state:

  4. Sticking things in globals will tend to cause you problems in projects like this, it's usually better to initiate them as attributes of self in a def __init__(self): function which will be called when the class is called.

  5. This might be petty but you could save yourself a lot of scrolling / effort by listing all the symbols in a separate file and then just load them up in your code. As it stands you have a lot of repetition in that list that you could cut out and make it far easier to read.

这篇关于Scrapy:在解析其他url之前等待解析特定的url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

1403页,肝出来的..

09-07 03:02