按顺序抓取 URL

2022-10-13 Python开发跟版网

Scrapy Crawl URLs in Order(按顺序抓取 URL)

本文介绍了按顺序抓取 URL的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

所以，我的问题比较简单.我有一个爬虫爬取多个站点，我需要它按照我在代码中编写的顺序返回数据.贴在下面.

So, my problem is relatively simple. I have one spider crawling multiple sites, and I need it to return the data in the order I write it in my code. It's posted below.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mlbodds.items import MlboddsItem

class MLBoddsSpider(BaseSpider):
   name = "sbrforum.com"
   allowed_domains = ["sbrforum.com"]
   start_urls = [
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110328/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110329/",
       "http://www.sbrforum.com/mlb-baseball/odds-scores/20110330/"
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       sites = hxs.select('//div[@id="col_3"]//div[@id="module3_1"]//div[@id="moduleData4952"]')
       items = []
       for site in sites:
           item = MlboddsItem()
           item['header'] = site.select('//div[@class="scoreboard-bar"]//h2//span[position()>1]//text()').extract()# | /*//table[position()<2]//tr//th[@colspan="2"]//text()').extract()
           item['game1'] = site.select('/*//table[position()=1]//tr//td[@class="tbl-odds-c2"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c4"]//text() | /*//table[position()=1]//tr//td[@class="tbl-odds-c6"]//text()').extract()
           items.append(item)
       return items

结果以随机顺序返回，例如返回第 29 个，然后是第 28 个，然后是第 30 个.我尝试将调度程序顺序从 DFO 更改为 BFO，以防万一出现问题，但这并没有改变任何东西.

The results are returned in a random order, for example it returns the 29th, then the 28th, then the 30th. I've tried changing the scheduler order from DFO to BFO, just in case that was the problem, but that didn't change anything.

推荐答案

start_urls 定义在 start_requests 方法.下载页面时，您的 parse 方法会调用每个起始 URL 的响应.但是你无法控制加载时间——第一个开始 url 可能会在 parse 的最后一个.

start_urls defines urls which are used in start_requests method. Your parse method is called with a response for each start urls when the page is downloaded. But you cannot control loading times - the first start url might come the last to parse.

一种解决方案——覆盖 start_requests 方法，并在生成的请求中添加一个带有 priority 键的 meta.在 parse 中提取此 priority 值并将其添加到 item.在管道中根据这个值做一些事情.(我不知道为什么以及在哪里需要按此顺序处理这些 url).

A solution -- override start_requests method and add to generated requests a meta with priority key. In parse extract this priority value and add it to the item. In the pipeline do something based in this value. (I don't know why and where you need these urls to be processed in this order).

或者让它同步——将这些起始网址存储在某个地方.将 start_urls 放入其中的第一个.在 parse 中处理第一个响应并生成项目，然后从您的存储中获取下一个 url 并使用 parse 的回调对其发出请求.

Or make it kind of synchronous -- store these start urls somewhere. Put in start_urls the first of them. In parse process the first response and yield the item(s), then take next url from your storage and make a request for it with callback for parse.

这篇关于按顺序抓取 URL的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持跟版网！

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除！

上一篇：在python中添加背景图像下一篇：在没有 Visual Studio 的情况下为 python 安装 MS C++ 14.0

相关文档推荐

修补类会产生“AttributeError:Mock object has no attribute"；访问实

patching a class yields quot;AttributeError: Mock object has no attributequot; when accessing instance attributes(修补类会产生“AttributeError:Mock object has no attribute；访问实例属性时)

如何在 Flask-SqlAlchemy 中模拟 <ModelClass>.query.fi

How to mock lt;ModelClassgt;.query.filter_by() in Flask-SqlAlchemy(如何在 Flask-SqlAlchemy 中模拟 lt;ModelClassgt;.query.filter_by())

FTPLIB 错误 socket.gaierror: [Errno 8] nodename nor servname p

FTPLIB error socket.gaierror: [Errno 8] nodename nor servname provided, or not known(FTPLIB 错误 socket.gaierror: [Errno 8] nodename nor servname provided, or not known)

添加零时奇怪的 numpy.sum 行为

Weird numpy.sum behavior when adding zeros(添加零时奇怪的 numpy.sum 行为)

为什么在使用 sum() 函数时会出现 'int' object is not callable 错误?

Why does the #39;int#39; object is not callable error occur when using the sum() function?(为什么在使用 sum() 函数时会出现 int object is not callable 错误?)

如何通过几列中的唯一索引对 pandas 求和?

How to sum in pandas by unique index in several columns?(如何通过几列中的唯一索引对 pandas 求和?)

栏目导航

前端开发 Java开发 C/C++开发 Python开发 C#/.NET开发 php开发移动开发数据库

最新文章

热门文章

热门标签

织梦资讯网织梦模板 dede 外语学校织梦鬼故事竞价网站源码竞价培训网门户网站织梦二次开发织梦笑话网 dedecms笑话网织梦源码网站建设搞笑图片织梦教程旅游网站源码织梦旅游网学校培训 html5 企业织梦源码医院源码后台样式移动营销页整形医院大学医院新手建站客服代码洗衣机维修企业网站淘宝客导航菜单教育网站学校源码装修网站装修模板美容整形女性健康妈妈网机械源码建站公司珠宝首饰苹果网站手机资讯美女图片织梦模版打包妇科源码安卓市场源码男性时尚网健康之家 app应用网站笑话网站下载站美女图片网中医院网站家装网站源码 QQ网站标牌网站魔兽世界网淘宝客源码 YY网站源码别墅设计网站服装搭配网宝宝起名网站长网站婚庆网站脑科医院源码笑话源码肝胆医院意外怀孕源码工作室