本文对scrapy-redis和scrapy-splash的配置描述不会很详细,主要在于讲解scrapy-redis集成scrapy-splash方法
github地址:https://github.com/rmax/scrapy-redis
可以采用全局配置,也可以采用custom_settings方式覆盖
参考https://github.com/rmax/scrapy-redis#usage
加上redis配置即可使用
当一个工程里面有多个spider时,无法全局配置的方式来使用scrapy-redis,只能采用局部配置。最简单的方法是将scrapy-redis的配置写在custom_settings中。
custom_settings= {
"REDIS_URL" : "redis://192.168.5.174:6379/3",
"SCHEDULER" : "scrapy_redis.scheduler.Scheduler",
"DUPEFILTER_CLASS" : "scrapy_redis.dupefilter.RFPDupeFilter",
#redis中数据类型为set时设置此项为True,默认为False
"REDIS_START_URLS_AS_SET" : True,
"ITEM_PIPELINES" : {
'scrapy_redis.pipelines.RedisPipeline': 300
}
}
但是在每个spider中都加上这个配置又显得太繁琐,因此采用继承RedisSpider的方法来重写相关配置
'''
封装了scrapy-redis的基本配置,无需修改setting.py即可使用scrapy-redis分布式爬虫
覆盖顺序:custom_settings>redis_settings>setting.py
因此仍然可以使用setting.py中的DOWNLOADER_MIDDLEWARES,如果不需要可以在custom_settings覆盖
'''
class MyRedisSpider(RedisSpider):
redis_url = None
def __init__(self , *args, **kwargs):
super(MyRedisSpider, self).__init__(*args, **kwargs)
@classmethod
def update_settings(cls, settings):
redis_settings = {
"REDIS_URL" : None,
"SCHEDULER" : "scrapy_redis.scheduler.Scheduler",
"DUPEFILTER_CLASS" : "scrapy_redis.dupefilter.RFPDupeFilter",
#redis中数据类型为set时设置此项为True,默认为False
"REDIS_START_URLS_AS_SET" : True,
"ITEM_PIPELINES" : {
'scrapy_redis.pipelines.RedisPipeline': 300
}
}
#子类的配置可以覆盖redis_settings
#redis_url必须配置custom_settings或类变量中
if(cls.custom_settings is not None):
cls.custom_settings = dict(redis_settings, **cls.custom_settings)
else:
cls.custom_settings = redis_settings
if(cls.redis_url is not None):
cls.custom_settings["REDIS_URL"] = cls.redis_url
settings.setdict(cls.custom_settings or {}, priority='spider')
scrapy-redis默认一次从redis中取出20条url数据,并通过yield Request方式执行,但是在实际使用过程中,redis中并不会直接存储url,而是在程序中拼接,因此需要重写make_request_from_data方法。
'''
实际业务中,在redis中使用set类型存储的公司名称,并且需要结合selenium调用浏览器发送请求
'''
class MySpider(MyRedisSpider):
name="my_spider"
allowed_domains = ["localhost"]
redis_key = 'companies'
redis_url = 'redis://localhost:6379/3'
#scrapy第一次请求的地址,可以为任意可访问的地址。公司名称会拼接到该地址的param中
url = "http://localhost"
# custom_settings = {
# "REDIS_URL" : "redis://localhost:6379/3"
# }
'''
此处重写的RedisSpider中的方法,data为redis中的一行数据
注:此处因为需要调用浏览器,只能通过url来进行传递company参数,url为任意可访问地址即可。
正常请求网页拼接url的方式相同
'''
def make_request_from_data(self, data):
'''
:params data bytes, Message from redis
'''
#url为任意可以访问的地址即可
company = bytes_to_str(data, self.redis_encoding)
url = self.url +"?company=" + company
return self.make_requests_from_url(url)
def parse(self,response):
#解析地址中的公司名称,response实际body并不需要
rs = urlparse.urlparse(response.url)
params = urlparse.parse_qs(rs.query,True)
company = params['company'][0].decode(self.redis_encoding)
self.logger.debug(company)
#调用浏览器及爬虫代码省略
参考利用scrapy-splash爬取JS生成的动态页面
class MySplashSpider(MyRedisSpider):
name="my_splash_spider"
allowed_domains = ["localhost"]
url = "http://localhost"
redis_url = 'redis://locaohost:6379/3'
redis_key = 'companies'
'''
redis中存储的为set类型的公司名称,使用SplashRequest去请求网页。
注意:不能在make_request_from_data方法中直接使用SplashRequest(其他第三方的也不支持),会导致方法无法执行,也不抛出异常
但是同时重写make_request_from_data和make_requests_from_url方法则可以执行
'''
def make_request_from_data(self, data):
'''
:params data bytes, Message from redis
'''
company = bytes_to_str(data, self.redis_encoding)
url = self.url+'/company/basic.jspx?company='+company
return self.make_requests_from_url(url)
def make_requests_from_url(self, url):
return SplashRequest(url,callback=self.parse,args={'wait':3, 'html':1})
def parse(self,response):
soup = pq(response.body_as_unicode())
#以下省略
scrapy-redis中配置了”DUPEFILTER_CLASS” : “scrapy_redis.dupefilter.RFPDupeFilter”,会覆盖scrapy-splash配置的DUPEFILTER_CLASS = ‘scrapy_splash.SplashAwareDupeFilter’
查看了scrapy_splash.SplashAwareDupeFilter源码后,发现他继承了scrapy.dupefilter.RFPDupeFilter,并重写了request_fingerprint()方法。比较scrapy.dupefilter.RFPDupeFilter和scrapy_redis.dupefilter.RFPDupeFilter中的request_fingerprint()方法后,发现是一样的,因此重写了一个SplashAwareDupeFilter,继承scrapy_redis.dupefilter.RFPDupeFilter,其他代码不变。
# -*- coding: utf-8 -*-
"""
To handle "splash" Request meta key properly a custom DupeFilter must be set.
See https://github.com/scrapy/scrapy/issues/900 for more info.
"""
from __future__ import absolute_import
from copy import deepcopy
from scrapy.utils.request import request_fingerprint
from scrapy.utils.url import canonicalize_url
from scrapy_splash.utils import dict_hash
from scrapy_redis.dupefilter import RFPDupeFilter
def splash_request_fingerprint(request, include_headers=None):
""" Request fingerprint which takes 'splash' meta key into account """
fp = request_fingerprint(request, include_headers=include_headers)
if 'splash' not in request.meta:
return fp
splash_options = deepcopy(request.meta['splash'])
args = splash_options.setdefault('args', {})
if 'url' in args:
args['url'] = canonicalize_url(args['url'], keep_fragments=True)
return dict_hash(splash_options, fp)
class SplashAwareDupeFilter(RFPDupeFilter):
"""
DupeFilter that takes 'splash' meta key in account.
It should be used with SplashMiddleware.
"""
def request_fingerprint(self, request):
return splash_request_fingerprint(request)
还需要修改MyRedisSpider,将里面的DUPEFILTER_CLASS改为上述类路径。
注:方法能否可行还待验证