Scrapy-redis分布式爬虫的实战案例【细节总结】

Scrapy-redis的原理分析:

Scrapy-redis分布式爬虫的实战案例【细节总结】_第1张图片

Scrapy-redis分布式爬虫的实战案例【细节总结】_第2张图片

 

实战案例【仅是在普通爬虫的基础上修改了几个细节】:

 第一步:配置setting.py【这里非常重要,决定分布式爬虫的成败关键】

配置官方文档是最完善的:https://pypi.org/project/scrapy-redis/

我的爬虫项目【scrapy_distributed】配置【一定要开启你的redis数据库-否则会链接失败】

// setting.py文件

# -*- coding: utf-8 -*-

# Scrapy settings for scrapy_first project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'scrapy_distributed'

SPIDER_MODULES = ['scrapy_distributed.spiders']
NEWSPIDER_MODULE = 'scrapy_distributed.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# 你要爬虫那个网站,就伪装它支持的爬虫平台,如百度阅读支持百度的爬虫
# https://yuedu.baidu.com/robots.txt第一个User-agent: Baiduspider
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure item pipelines
# Store scraped item in redis for post-processing.
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'scrapy_distributed.pipelines.BookPipeline': 100,
    'scrapy_redis.pipelines.RedisPipeline': 300
}
# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Specify the host and port to use when connecting to Redis (optional).
REDIS_HOST = 'localhost'
REDIS_PORT = 6379
# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
# REDIS_URL = 'redis://user:pass@hostname:9001'
REDIS_URL = 'redis://localhost:6379'

 

//myspider.py爬虫文件

from scrapy import Request
from scrapy_distributed.items import BookItem
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisSpider


class BookSpider(RedisSpider):
    # name属性就是spider_name,【scrapy crawl spider_name】执行的文件名(必须是唯一)
    name = 'redis'
    allowed_domains = ['yuedu.baidu.com']
    # 分布式爬虫最大的区别之一就是start_urls是手动推动redis数据库,让爬虫程序从redis数据库中读取!
    # 然后所有爬虫【多台电脑】爬出来的数据进行汇总存取redis数据库中!
    # start_urls = ['https://yuedu.baidu.com/rank/hotsale']

    def __init__(self, name=None, **kwargs):
        super().__init__(name=name, **kwargs)
        self.title = ""
        self.price = ""
        self.tags = ""
        self.author = ""
        self.copyright = ""
        self.score = ""

    # Scrapy爬虫第一步:获取所有的链接,并用yield关键字创建一个所有链接的生成器
    def parse(self, response):

        url = response.urljoin(response.xpath("//div[@class='pager-inner']/a/@href").get())
        if url:
            yield Request(url=url, callback=self.parse)

        extractor = LinkExtractor(restrict_css=".book .al.title-link")
        for link in extractor.extract_links(response):
            yield Request(url=link.url, callback=self.parse_book)

    # Scrapy爬虫第二步:根据不同的链接发送Request请求,设定不同的parse解析回调函数逐步处理!
    def parse_book(self, response):
        for book in response.css(".doc-info-bd.clearfix .content-block"):
            item = BookItem()
            item["title"] = book.xpath(".//h1[@class='book-title']/text()").get()
            item["price"] = book.xpath(".//span[@class='numeric']/text()").get()
            item["author"] = book.xpath(".//a[@class='doc-info-field-val doc-info-author-link']/text()").get()
            item["copyright"] = book.xpath(".//a[@class='doc-info-field-val']/text()").get()
            item["tags"] = [value.get() for value in book.xpath(".//a[@class='tag-item doc-info-field-val mb5']/text()")]
            item["score"] = book.xpath(".//span[@class='doc-info-read-count']/text()").get()
            yield item



//items.py文件

from scrapy  import Item,Field


class BookItem(Item):
    title = Field()
    price = Field()
    tags = Field()
    author = Field()
    copyright = Field()
    score = Field()
//pipelines.py文件

from scrapy.exceptions import DropItem
from scrapy.pipelines.files import FilesPipeline
from scrapy.pipelines.images import ImagesPipeline
from scrapy import Request
import re


class BookPipeline:

    # from_crawler返回一个实例对象,初始化配置文件读取的数据
    def __init__(self, count):
        self.count = count

    # 所有的数据都会经过这个方法,这个方法会自动重复调用(items列表组数)
    # 如果要过滤某些数据,在这个函数中过滤条件抛出DropItem异常即可!
    def process_item(self, item, spider):
        # 数据全部去空格和\n
        item["title"] = item.get("title").replace(r"\n", "").strip()
        return item

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            # 从./settings.py文件中读取字段的方法
            count=crawler.settings["BOOK_FILTER_COUNT"]
        )

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

 

准备开始启动分布式爬虫,执行下面两条命令


// 在多台电脑同时运行爬虫【必须连接同一个redis数据库】
scrapy crawl redis

// 给你的爬虫名【你爬虫类的name属性】的start_urls设置起始url
redis-cli lpush redis:start_urls https://yuedu.baidu.com/rank/hotsale

Scrapy-redis分布式爬虫的实战案例【细节总结】_第3张图片

你可能感兴趣的:(Scrapy-redis)