【scrapy_redis】简单分布式爬虫2

scrapy版本：1.5.1
scrapy-redis版本：0.6.8
redis版本：2.10.6
scrapy_redis的git：https://github.com/rmax/scrapy-redis
该项目的git：https://github.com/MsLpoi/sr_demo

在该系列第一篇的基础上，我们继续来编写从爬虫吧~

1. 开始

基于上一篇《【scrapy_redis】简单分布式爬虫1》的sr_demo项目

1. setting.py

1. 在文件的最后添加以下代码（就是scrapy_redis的git首页里面的设置代码，但把它的ITEM_PIPELINES的设置删掉了）

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
REDIS_URL = 'redis://localhost:6379'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'

里面的设置项很多，但我们目前只需关注REDIS_HOST、REDIS_PORT、REDIS_URL。
1）REDIS_HOST、REDIS_PORT：这两个是搭配使用的，分别是设置需要连接的redis数据库的HOST和PORT。
2）REDIS_URL：这也是用于需要设置连接的redis数据库的，但是它能设置的除了HOST和PORT，还有很多别的设置项，建议大家自行了解并按照需要修改。
注：第1项和第二项不会同时生效，优先级：(REDIS_URL) > (REDIS_HOST、REDIS_PORT)，建议直接设置REDIS_URL。

2. read_urls.py

在sr_demo项目下使用以下命令创建一个新的爬虫

scrapy genspider -t basic read_urls news.stcn.com

并在该文件输入以下代码

# -*- coding: utf-8 -*-

from scrapy_redis.spiders import RedisSpider


class ReadUrlsSpider(RedisSpider):
    name = 'read_urls'
    redis_key = 'read_urls:start_urls'
    custom_settings = {
        'ITEM_PIPELINES': {
            'scrapy_redis.pipelines.RedisPipeline': 300,
        }
    }


    def parse(self, response):
        url = response.url
        title = response.xpath('//div[@class="intal_tit"]/h2/text()').exctart_frist()
        tmp = {'url': url, 'title': title}
        return tmp

这里继承的是RedisSpider类（当然还有RedisCrawlSpider类，我没用而已）。
redis_key：爬虫将会在redis数据库中的redis_key列表中获取待爬取的链接，因此我们直接输入之前得到的read_urls:start_urls列表就好了。

3. 运行

运行read_urls爬虫，并查看数据库，没错的话就成功啦~

scrapy crawl read_urls

思考：

1. 将setting.py中的SCHEDULER_PERSIST设置为False，再运行get_urls、read_urls爬虫，分别观察redis数据库内容变化。
2. 当爬取完链接之后，爬虫会一直空跑（实际上是一直在等待read_urls:start_urls列表中的列表）。这时候再运行一遍read_urls爬虫，再观察输出。

【scrapy_redis】简单分布式爬虫2

1. 开始

1. setting.py

2. read_urls.py

3. 运行

思考：

你可能感兴趣的:(【scrapy_redis】简单分布式爬虫2)