获取scrapy的请求队列操作及队列处理完毕后sleep并重启爬虫

一、源码分析

Scrapy默认爬虫引擎是scrapy.core.engine.ExecutionEngine,其中的_next_request函数负责从调度器的队列中取得下一个Request对象进行处理,处理完后会调用spider_is_idle函数检查爬虫的请求队列是否为空,下面是简略的源代码:


class ExecutionEngine(object):
    ...

    def _next_request(self, spider):
        slot = self.slot
        if not slot:
            return

        if self.paused:
            return

        while not self._needs_backout(spider):
            if not self._next_request_from_scheduler(spider):
                break

        if slot.start_requests and not self._needs_backout(spider):
            try:
                request = next(slot.start_requests)
            except StopIteration:
                slot.start_requests = None
            except Exception:
                slot.start_requests = None
                logger.error('Error while obtaining start requests',
                             exc_info=True, extra={'spider': spider})
            else:
                self.crawl(request, spider)

        if self.spider_is_idle(spider) and slot.close_if_idle:
            self._spider_idle(spider)

    ...

    def spider_is_idle(self, spider):
        if not self.scraper.slot.is_idle():
            # scraper is not idle
            return False

        if self.downloader.active:
            # downloader has pending requests
            return False

        if self.slot.start_requests is not None:
            # not all start requests are handled
            return False

        if self.slot.scheduler.has_pending_requests():
            # scheduler has pending requests
            return False

        return True

    ...

可以看到检查爬虫是否可以推出分为4部分:

        1. 检查下载控制器是否空闲

        2. 检查下载器中是否有等待的请求

        3. 检查下载控制器的初始请求是否已经全部处理完毕

        4. 检查调度器中是否有等待处理的请求

由此可以看出爬虫的请求主要存在于下载器队列、初始请求队列、调度器队列中。

二、爬虫定时sleep重启示例

scrapy中使用time.sleep会导致整个爬虫都休眠,最近要实现“一轮抓取完毕后定时重新开始下一轮抓取”的功能,那么只需要在爬虫队列为空之后、爬虫关闭之前向队列中重新添加起始请求就行了,下面是例子:

import time
import scrapy


class MySpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['example.com']
    start_urls = ['https://example.com']
    sleep_time = 10

    def start_requests(self):
        return [scrapy.http.Request(url, dont_filter=True) for url in self.start_urls]

    def check_loop_start(self):
        if self.is_finished():  # 当队列中没有请求时休眠并重启
            self.logger.info('本轮采集完毕,%s秒后开始下一轮采集' % self.sleep_time)
            time.sleep(self.sleep_time)
            return self.start_requests()
        return []

    def is_finished(self):  # 检查队列中是否还有请求
        if self.crawler.engine.downloader.active:
            return False
        if self.crawler.engine.slot.start_requests is not None:
            return False
        if self.crawler.engine.slot.scheduler.has_pending_requests():
            return False
        return True

    def parse(self, response):
        for retry in range(5):
            yield scrapy.http.Request('https://example.com', dont_filter=True, callback=self.next_parse)
        for request in self.check_loop_start():
            yield request

    def next_parse(self, response):
        for request in self.check_loop_start():
            yield request

三、关于爬虫的调度器和队列

如果要获得爬虫队列,可以在上面的爬虫中使用(参考scrapy.core.scheduler.Scheduler):

self.crawler.engine.slot.scheduler.dqs  # 磁盘队列
self.crawler.engine.slot.scheduler.mqs  # 内存队列

如果要获得爬虫队列的总长度,可以在上面的爬虫中使用(参考scrapy.core.scheduler.Scheduler):

len(self.crawler.engine.slot.scheduler)

 

你可能感兴趣的:(Python)