【系列】scrapy启动流程源码分析(6)Downloader下载器

6.Downloader下载器

Downloader包含了从调度器调取url之后到scraper获取返回的网页内容之前的所有步骤,关系到网页如何下载,网络通信/HTTP协议/服务器等一系列知识,是最复杂的一部分内容。

Downloader初始化

在之前Crawler.crawl()创建ExecutionEngine执行引擎的时候,就已经初始化了Downloader对象。此对象在配置中定义,默认为:

scrapy/settings/default_settings.py:

DOWNLOADER = 'scrapy.core.downloader.Downloader'

scrapy/core/downloader/init.py:

class Downloader(object):

    def __init__(self, crawler):
        self.settings = crawler.settings
        self.signals = crawler.signals
        self.slots = {}
        self.active = set()
        self.handlers = DownloadHandlers(crawler)
        self.total_concurrency = self.settings.getint('CONCURRENT_REQUESTS')
        self.domain_concurrency = self.settings.getint('CONCURRENT_REQUESTS_PER_DOMAIN')
        self.ip_concurrency = self.settings.getint('CONCURRENT_REQUESTS_PER_IP')
        self.randomize_delay = self.settings.getbool('RANDOMIZE_DOWNLOAD_DELAY')
        self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
        self._slot_gc_loop = task.LoopingCall(self._slot_gc)
        self._slot_gc_loop.start(60)

关键对象有4个:slots,active,DownloadHandlers,DownloaderMiddlewareManager以及一些配置选项。

Downloader:slots对象

这个slots是一个存储Slot对象的字典,key是request对应的域名,值是一个Slot对象。
Slot对象用来控制一种Request下载请求,通常这种下载请求是对于同一个域名。
这个Slot对象还控制了访问这个域名的并发度,下载延迟控制,随机延时等,主要是为了控制对一个域名的访问策略,一定程度上避免流量过大被封IP。

scrapy/core/downloader/init.py#Downloader:

    def _get_slot(self, request, spider):
        key = self._get_slot_key(request, spider)
        if key not in self.slots:
            conc = self.ip_concurrency if self.ip_concurrency else self.domain_concurrency
            conc, delay = _get_concurrency_delay(conc, spider, self.settings)
            self.slots[key] = Slot(conc, delay, self.randomize_delay)

        return key, self.slots[key]
        
    def _get_slot_key(self, request, spider):
        if 'download_slot' in request.meta:
            return request.meta['download_slot']

        key = urlparse_cached(request).hostname or ''
        if self.ip_concurrency:
            key = dnscache.get(key, key)

        return key

对于一个request,先调用’_get_slot_key’获取request对应的key,’_get_slot_key’函数中,可以通过给request的meta中添加’download_slot’来控制request的key值,这样增加了灵活性。如果没有定制request的key,则key值来源于request要访问的域名。
另外对于request对应的域名也增加了缓存机制:urlparse_cached,dnscahe。
同时也通过slots集合达到了缓存的目的,对于同一个域名的访问策略可以通过slots获取而不用每次都解析配置。
然后根据key从slots里取对应的Slot对象,如果还没有,则构造一个新的对象。

这个Slot对象有3个参数,并发度,延迟时间和随机延迟。

Downloader:active对象

active是一个活动集合,用于记录当前正在下载的request集合。

Downloader:DownloadHandlers对象

是一个DownloadHandlers对象,它控制了许多handlers,对于不同的下载协议使用不同的handlers。

默认支持handlers如下:
scrapy/settings/default_settings.py:

DOWNLOAD_HANDLERS_BASE = {
    'data': 'scrapy.core.downloader.handlers.datauri.DataURIDownloadHandler',
    'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler',
    'http': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    'https': 'scrapy.core.downloader.handlers.http.HTTPDownloadHandler',
    's3': 'scrapy.core.downloader.handlers.s3.S3DownloadHandler',
    'ftp': 'scrapy.core.downloader.handlers.ftp.FTPDownloadHandler',
}

后面下载网页会调用handler的download_request方法。

Downloader:DownloaderMiddlewareManager对象

Downloader的中间件实现与scraper中间件的实现过程一样,也是通过Manager类管理。

默认中间件:
scrapy/settings/default_settings.py:

DOWNLOADER_MIDDLEWARES_BASE = {
    # Engine side
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
    'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
    # Downloader side
}

scrapy/core/downloader/middleware.py:

class DownloaderMiddlewareManager(MiddlewareManager):

    component_name = 'downloader middleware'

    @classmethod
    def _get_mwlist_from_settings(cls, settings):
        return build_component_list(
            settings.getwithbase('DOWNLOADER_MIDDLEWARES'))

    def _add_middleware(self, mw):
        if hasattr(mw, 'process_request'):
            self.methods['process_request'].append(mw.process_request)
        if hasattr(mw, 'process_response'):
            self.methods['process_response'].insert(0, mw.process_response)
        if hasattr(mw, 'process_exception'):
            self.methods['process_exception'].insert(0, mw.process_exception)

可知downloader中间件默认能够识别的所有信号处理函数为:open_spider,close_spider,process_request,process_response,process_exception

Downloader.fetch()下载网页

fetch方法开始于engine的_next_request_from_scheduler。

scrapy/core/engine.py#ExecutionEngine:

    def _next_request_from_scheduler(self, spider):
        slot = self.slot
        request = slot.scheduler.next_request()
        if not request:
            return
        d = self._download(request, spider)

    def _download(self, request, spider):
        slot = self.slot
        slot.add_request(request)
        def _on_success(response):
            assert isinstance(response, (Response, Request))
            if isinstance(response, Response):
                response.request = request # tie request to response received
                logkws = self.logformatter.crawled(request, response, spider)
                logger.log(*logformatter_adapter(logkws), extra={'spider': spider})
                self.signals.send_catch_log(signal=signals.response_received, \
                    response=response, request=request, spider=spider)
            return response

        def _on_complete(_):
            slot.nextcall.schedule()
            return _

        dwld = self.downloader.fetch(request, spider)
        dwld.addCallbacks(_on_success)
        dwld.addBoth(_on_complete)
        return dwld

_download方法首先将request加入slot的inprogress集合记录正在进行的request,然后调用下载器downloader的fetch方法,给fetch返回的deferred添加一个’_on_success’方法,这样在下载完成后会打印日志并发送一个response_received消息给关心者。

scrapy/core/downloader/init.py#Downloader:

    def fetch(self, request, spider):
        def _deactivate(response):
            self.active.remove(request)
            return response

        self.active.add(request)
        dfd = self.middleware.download(self._enqueue_request, request, spider)
        return dfd.addBoth(_deactivate)

首先,调用中间件管理器的download方法,同时传入了自己的_enqueue_request方法。

scrapy/core/downloader/middleware.py#DownloaderMiddlewareManager:

    def download(self, download_func, request, spider):
        @defer.inlineCallbacks
        def process_request(request):
            for method in self.methods['process_request']:
                response = yield method(request=request, spider=spider)
                assert response is None or isinstance(response, (Response, Request)), \
                        'Middleware %s.process_request must return None, Response or Request, got %s' % \
                        (six.get_method_self(method).__class__.__name__, response.__class__.__name__)
                if response:
                    defer.returnValue(response)
            defer.returnValue((yield download_func(request=request,spider=spider)))

        @defer.inlineCallbacks
        def process_response(response):
            assert response is not None, 'Received None in process_response'
            if isinstance(response, Request):
                defer.returnValue(response)

            for method in self.methods['process_response']:
                response = yield method(request=request, response=response,
                                        spider=spider)
                assert isinstance(response, (Response, Request)), \
                    'Middleware %s.process_response must return Response or Request, got %s' % \
                    (six.get_method_self(method).__class__.__name__, type(response))
                if isinstance(response, Request):
                    defer.returnValue(response)
            defer.returnValue(response)

        @defer.inlineCallbacks
        def process_exception(_failure):
            exception = _failure.value
            for method in self.methods['process_exception']:
                response = yield method(request=request, exception=exception,
                                        spider=spider)
                assert response is None or isinstance(response, (Response, Request)), \
                    'Middleware %s.process_exception must return None, Response or Request, got %s' % \
                    (six.get_method_self(method).__class__.__name__, type(response))
                if response:
                    defer.returnValue(response)
            defer.returnValue(_failure)

        deferred = mustbe_deferred(process_request, request)
        deferred.addErrback(process_exception)
        deferred.addCallback(process_response)
        return deferred

与SpiderMiddlewareManager的scrape_reponse方法类似,先依次调用下载中间件的’process_request’方法处理request,然后调用Downloader的’_enqueue_request’方法进行下载,最后对response依次调用中间件的’process_response’方法。

scrapy/core/downloader/init.py#Downloader:

    def _enqueue_request(self, request, spider):
        key, slot = self._get_slot(request, spider)
        request.meta['download_slot'] = key

        def _deactivate(response):
            slot.active.remove(request)
            return response

        slot.active.add(request)
        deferred = defer.Deferred().addBoth(_deactivate)
        slot.queue.append((request, deferred))
        self._process_queue(spider, slot)
        return deferred

_enqueue_request方法会调用前面分析的’_get_slot’方法获取request相对应的Slot对象(主要是分析域名),然后向对应的slot对应的活动集合active中添加一个request,并向slot的队列queue添加request和对应的deferred对象。然后调用’_process_queue’方法处理slot对象。

scrapy/core/downloader/init.py#Downloader:

    def _process_queue(self, spider, slot):
        if slot.latercall and slot.latercall.active():
            return

        # Delay queue processing if a download_delay is configured
        now = time()
        delay = slot.download_delay()
        if delay:
            penalty = delay - now + slot.lastseen
            if penalty > 0:
                slot.latercall = reactor.callLater(penalty, self._process_queue, spider, slot)
                return

        # Process enqueued requests if there are free slots to transfer for this slot
        while slot.queue and slot.free_transfer_slots() > 0:
            slot.lastseen = now
            request, deferred = slot.queue.popleft()
            dfd = self._download(slot, request, spider)
            dfd.chainDeferred(deferred)
            # prevent burst if inter-request delays were configured
            if delay:
                self._process_queue(spider, slot)
                break

这个方法主要用于从slot对象的队列queue中获取请求并下载。
功能点:
1.如果一个latercall正在运行则直接返回;
2.获取slot对象的延迟时间;
3.距离上次运行还需要延迟则latercall;
4.不停地处理slot队列queue中的请求,如果队列非空且有空闲的传输slot,则下载,如果需要延迟则继续调用’_process_queue’。

这个方法通过最下面的while循环处理队列中的请求,并判断当前是否有空闲的传输slot,有空闲的才继续下载处理。
处理下载请求时,会不断更新slot的lastseen为当前时间,这个值代表了slot的最近一次活跃下载时间。

如果需要delay则再次调用’_process_queue’,否则不停地继续下载request。
再次调用后,会先计算延迟时间距离上次活跃时间是否到时,如果还要延迟则启动一个latercall(通过twisted的reactor的callLater实现)。这个latercall会再次处理slot的队列queue.因此入口处判断如果有正在活动的latercall则不再处理。
这样,就不断地处理下载请求,并根据需要进行适当的延迟。

scrapy/core/downloader/init.py#Downloader:

    def _download(self, slot, request, spider):
        # The order is very important for the following deferreds. Do not change!

        # 1. Create the download deferred
        dfd = mustbe_deferred(self.handlers.download_request, request, spider)

        # 2. Notify response_downloaded listeners about the recent download
        # before querying queue for next request
        def _downloaded(response):
            self.signals.send_catch_log(signal=signals.response_downloaded,
                                        response=response,
                                        request=request,
                                        spider=spider)
            return response
        dfd.addCallback(_downloaded)

        # 3. After response arrives,  remove the request from transferring
        # state to free up the transferring slot so it can be used by the
        # following requests (perhaps those which came from the downloader
        # middleware itself)
        slot.transferring.add(request)

        def finish_transferring(_):
            slot.transferring.remove(request)
            self._process_queue(spider, slot)
            return _

        return dfd.addBoth(finish_transferring)

这里调用了DownloadHandlers的download_request方法,并向传输集合transferring中添加正在传输request。
并给返回的Deferred对象添加了finish_transferring方法。每次下载一个request完成,都会从传输集合中移除request,并触发一次_process_queue操作,这样就保证了队列queue中的请求不会残留。

scrapy/core/downloader/handlers/init.py#DownloadHandlers:

    def download_request(self, request, spider):
        scheme = urlparse_cached(request).scheme
        handler = self._get_handler(scheme)
        if not handler:
            raise NotSupported("Unsupported URL scheme '%s': %s" %
                               (scheme, self._notconfigured[scheme]))
        return handler.download_request(request, spider)

这里根据url的scheme获取对应的handler,这里的handler前面已经讲过了。就是不同协议对应不同的handler,调用不同handler的download_request进行实际下载。
不同handler的download_request分别调用不同的twisted接口,使用twisted框架自带的网络通信功能进行网页下载。

后记

写到这里,scrapy的五大核心组件就基本上都有了:engine,scheduler,scraper,spidermw,downloader

由于实践不多,暂时不再继续深入分析,后续内容将会在构建一个完整的项目之后再归纳。

部分内容参考此博主的相关文章:https://blog.csdn.net/happyanger6/article/category/6085726

下一篇

你可能感兴趣的:(Python,爬虫)