使用python 调用scrapy的 爬虫Spider并且相互之间可以正常传参实现全局

各种调用scrapy的方法有很多,比如:

import os
os.system("scrapy crawl SpiderName")
import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished

但能实现标题所说功能的只有两个,在这里总结一下:

第一种,简单,但类本身在最后一行有句sys.exit(cmd.exitcode),注定了他执行完就退出程序,不再执行后面的语句,所以只适合调试时使用。

from scrapy import cmdline
cmdline.execute("scrapy crawl SpiderName".split())

第二种,相对第一种会多几行代码,但是没有第一种的缺点,建议使用。

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.
process.crawl('SpiderName', domain='123.com')
process.start() # the script will block here until the crawling is finished

 

你可能感兴趣的:(使用python 调用scrapy的 爬虫Spider并且相互之间可以正常传参实现全局)