我在开发scrapy爬虫的时候,想让scrpy启动的时候直接指向一个配置文件,
即,我开启的时候用下面语句,则爬虫服务会加载上mysettings.py中的内容,这个内容可能会写
pipline,mysql路径之类的,方便不同的环境,也方便不同的 spider(指单个爬虫任务)
scrapy crawl TestMReview -a settingpath=/home/abc/conf/mysettings.py
所以想通过了解 scrapy开启进程时的初始化,顺序是怎样的 (网上没有搜到,也有可能是我的思路不对)
下面是按启动顺序的逻辑,
最先是把settings给加进来.
site-packages\scrapy\cmdline.py中的 settings = get_project_settings() 这里首先会取site-packages\scrapy\settings\default_settings.py中的配置,然后取项目的settings加进来
customers_settings在这里没有处理,后续会在 crawler(不是CrawlerProcess)的init中加进来
入口:cmd.crawler_process = CrawlerProcess(settings)
site-packages\scrapy\cmdline.py执行cmd.crawler_process = CrawlerProcess(settings),这是一些初始化的动作.
CrawlerRunner初始化:
CrawlerRunner,会初始化一些内容 如scrapy.spiderloader.SpiderLoader
-> SpiderLoader中的init,中self._load_all_spiders()
-> 加载class TestMReview(RedisSpider)中的成员,如custom_settings,name,allowed_domains中关键内容,
-> 加载的是type,如
-> 上面的spider各种type,会在spiderloader中存起来
里面有些小utils类,简单记下
from scrapy.utils.misc import load_object中 loader_cls = load_object(cls_path)可以根据路径加载cls,如 load_object(‘scrapy.spiderloader.SpiderLoader’)
class ISpiderLoader(Interface) 如果是接口,可以@implementer(ISpiderLoader)
copy.deepcopy(self) copy一个对象 import copy
misc.py中的 def walk_modules(path)会把目录上的所有module给加载到一个list,
入口:site-packages\scrapy\cmdline.py执行 _run_print_help(parser, _run_command, cmd, args, opts)
Crawler的init一些逻辑:
加载type spidercls 【spidercls = self.spider_loader.load(spidercls))
填充customers_settings到settings 【self.spidercls.update_settings(self.settings)
初始化一些middler,
init的最后self.settings.freeze(),会把settings给冰冻起来
关于 customers_settings的加载,这里实际上是在spider中,custom_settings = {}中执行
site-packages\scrapy\crawler.py中
上面是初始化crawler,这里是其它动作:
初始化spider, 初始化时的顺序:from_crawler->init ->_set_crawler
初始化 engine 【self.engine = self._create_engine() ( 这里会初始化其它的middlerware)
self.engine.open_spider(self.spider, start_requests)会做些什么呢?
1,入口:
ExecutionEngine -》
site-packages\scrapy\core\downloader_init_.py
中class Downloader(object),即在上面的 初始化 engine时
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
2, 建立middlers等的对象
通过 DownloaderMiddlewareManager(site-packages\scrapy\middleware.py)的
from_settings(mwlist = cls._get_mwlist_from_settings(settings)获取相应的内容,找到settings中的downloaderMiddleware列表,)
然后执行这些middlerware的from_crawler获取对象 (方法里面都有cls())
3,
得到middlers对象列表扔到DownloaderMiddlewareManager中
1,入口:
ExecutionEngine -》
site-packages\scrapy\core\scraper.py
后续:
SpiderMiddlewareManager和DownloaderMiddlewareManager都继承自MiddlewareManager,所以初始化内容差不多
包括上面的,在初始化的时候,其获取配置后初始化的 核心代码在
MiddlewareManager.from_settings(mwlist = cls._get_mwlist_from_settings
虽然都是middler,但其执行位置是不一样的!!!
我们打断点,查看其 cls和mwlist对象 ,有好些继承了这个类的子类进行初始化:
mwlist = {list} : ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.memdebug.MemoryDebugger', 'scrapy.extensions.closespider.CloseSpider', 'scrapy.extensions.feede
0 = {str} 'scrapy.extensions.corestats.CoreStats'
1 = {str} 'scrapy.extensions.telnet.TelnetConsole'
2 = {str} 'scrapy.extensions.memusage.MemoryUsage'
3 = {str} 'scrapy.extensions.memdebug.MemoryDebugger'
4 = {str} 'scrapy.extensions.closespider.CloseSpider'
5 = {str} 'scrapy.extensions.feedexport.FeedExporter'
6 = {str} 'scrapy.extensions.logstats.LogStats'
7 = {str} 'scrapy.extensions.spiderstate.SpiderState'
8 = {str} 'scrapy.extensions.throttle.AutoThrottle'
__len__ = {int} 9
mwlist = {list} : ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheader
00 = {str} 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware'
01 = {str} 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware'
02 = {str} 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware'
03 = {str} 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware'
04 = {str} 'wtest.middlewares.RandomUserAgentMiddlware'
05 = {str} 'wtest.middlewares.RandomProxyMiddleware'
06 = {str} 'wtest.middlewares.signalDownloaderMiddleware'
07 = {str} 'scrapy.downloadermiddlewares.retry.RetryMiddleware'
08 = {str} 'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware'
09 = {str} 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware'
10 = {str} 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware'
11 = {str} 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware'
12 = {str} 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware'
13 = {str} 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware'
14 = {str} 'scrapy.downloadermiddlewares.stats.DownloaderStats'
15 = {str} 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware'
__len__ = {int} 16
mwlist = {list} : ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spiderm
0 = {str} 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware'
1 = {str} 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware'
2 = {str} 'scrapy.spidermiddlewares.referer.RefererMiddleware'
3 = {str} 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware'
4 = {str} 'scrapy.spidermiddlewares.depth.DepthMiddleware'
__len__ = {int} 5
mwlist = {list} : ['wtest.pipelines.KafkaReviewPipline']
0 = {str} 'wtest.pipelines.KafkaReviewPipline'
__len__ = {int} 1
*关于 settings中的配置有优先级
在逻辑中会以 settings.setdict(cmd.default_settings, priority=‘command’)这样的形式表示
site-packages\scrapy\settings_init_.py 有定义:
SETTINGS_PRIORITIES = {
'default': 0,
'command': 10,
'project': 20,
'spider': 30,
'cmdline': 40,
}
https://www.processon.com/view/link/5c283bb0e4b089c3cb5ef480
结合 文章开头的背景, 我还是没有找到答案.
如下图,在spider被加载时(不是初始化), cmd中的参数没有传进来,而且,也没有钩子去做这个入口.
即,如下代码,X1~X5 是按顺序走的,我们能合到 cmd传过来的参数 只有到了X4,而这个时间 settings已经被冻结了。所以没有找到 好的办法…
class abcSpider(RedisSpider):
def __init__(self, **kwargs):
print("abcSpider") # X5
def _set_crawler(self, crawler):
super()._set_crawler(crawler)
@classmethod
def from_crawler(self, crawler, *args, **kwargs):
self.custom_settings["aaaa"]="sd" # X4
return super().from_crawler(crawler, *args, **kwargs)
name = 'abcSpider'
allowed_domains = ['abcdefg.com', 'abcdefg-a.sss.net']
custom_settings = PythonToMap.genMap(settings_abcSpider) # X1
# scrpy加载 customer_settings X2
# scrapy 冻结 setting #X3
在实际的使用中,我们是用scarpyd来部署的,我没有去测试scrapyd中参数是否是我们上面的cmd(但我猜是的.)