Scrapy_redis分布式爬取某电影网站（断点下载+下载进度条显示）

一、背景介绍

操作系统及环境

操作系统：Win10（主）、Ubuntu（从）
Python版本：Python3.6
Scrapy版本：Scrapy1.5.1
scrapy_redis：两台电脑都需要安装
redis数据库：主服务器的redis数据库要运行远程连接

因为只是为了分享如何进行简单的分布式爬取，所以选取了一个结构比较简单的网站（网址不适合公开，仅作学习用途）

二、代码

主要思路
使用scrapy_redis的框架来实现该网站的分布式爬取。总共分成如下几个步骤：
1、第一个爬虫抓取需要下载的url信息存入reids数据库的队列（只需要放在主服务器）。从机通过redis数据库的队列来获取需要去抓取的url
2、第二个爬虫获取电影的信息，并将信息放回pipelines进行持久化存储
3、下载电影时配置断点下载以及进度条的显示
项目目录结构

image.png

  - crawlall.py文件：负责启动多个爬虫
  - crawl_url.py文件：负责抓取url，保存到redis队列
  - video_6969.py文件：爬取电影
  - items.py文件：保存电影字段
  - pipelines.py文件：下载电影、断点下载、下载进度条、保存到redis数据库
  - settings.py文件：配置信息

先配置我们的settings.py文件

# -*- coding: utf-8 -*-

# Scrapy settings for Video_6969 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Video_6969'

SPIDER_MODULES = ['Video_6969.spiders']
NEWSPIDER_MODULE = 'Video_6969.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Video_6969 (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 150

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 200
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Video_6969.middlewares.Video6969SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Video_6969.middlewares.Video6969DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 分布式爬虫的数据可以不通过本地的管道（数据不需要存在本地）。数据要存入到redis数据库中，所以这里需要加入一个reids数据库的管道组件
   'Video_6969.pipelines.Video6969Pipeline': 300,
   "scrapy_redis.pipelines.RedisPipeline": 100,  # item数据会报错到redis
   "Video_6969.pipelines.CrawlUrls": 50,
   # 'Video_6969.pipelines.Video6969Info': 200,
}


# 指定Redis数据库相关的配置
# Redis的主机地址
REDIS_HOST = '10.36.133.11'  # 主机
REDIS_PORT = 6379  # 端口
# REDIS_PARAMS = {"password": "xxxx"}  # 密码


# 调度器需要切换成Scrapy_Redis的调度器（是Scrapy_Redis组件对原生调度器的重写，加入了一些分布式调度的算法）
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 加入scrapy_redis的去重组件
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 爬取过程中是否运行暂停
SCHEDULER_PERSIST = True


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 日志
# 关闭日志或调整Debug级别
# LOG_ENABLED = False
# LOG_LEVEL = 'ERROR'


LOG_LEVEL = 'DEBUG'
"""
CRITICAL - 严重错误
ERROR - 一般错误
WARNING - 警告信息
INFO - 一般信息
DEBUG - 调试信息
"""

# 日志文件
LOG_FILE = '6969.log'

# 是否启用日志（创建日志后，不需开启，进行配置）
LOG_ENABLED = True  # （默认为True，启用日志）

# 如果是True ，进程当中，所有标准输出（包括错误）将会被重定向到log中
LOG_STDOUT = False

# 日志编码
LOG_ENCODING = 'utf-8'


# 配置启动所有爬虫
COMMANDS_MODULE = 'Video_6969.commands'

# MongoDB配置
MONGO_HOST = "127.0.0.1"  # 主机IP
MONGO_PORT = 27017  # 端口号
MONGO_DB = "6969"  # 库名
MONGO_COLL = "ViodeInfo"  # collection名
# 如果有用户名和密码
# MONGO_USER = "zhangsan"
# MONGO_PSW = "123456"

注意：现在爬虫要继承自RedisCrawlSpider，且urls要从redis数据库中根据redis_key配置的值进行获取，所以我们要将start_urls注释。后面我们将在redis配置我们的起始url。

crawl_url.py文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item, UrlItem


class Video6969(CrawlSpider):
    name = 'crawl_urls'
    start_urls = ['https://www.6969qq.com']
    rules = (
        Rule(LinkExtractor(allow=r'/html/\d+/'), follow=True),  # 分类
        Rule(LinkExtractor(allow=r'/vod/\d+/.+?html'), callback='video_info', follow=True),  # 更多
    )

    def video_info(self, response):
        item = UrlItem()
        item['html_url'] = response.url
        yield item

crawl_url.py文件负责抓取我们需要下载的url页面，再通过pipelines存储到redis队列中。（也可以直接在crawl_url里进行持久化存储）

video_6969.py文件

# -*- coding: utf-8 -*-

from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item


class Video6969(RedisCrawlSpider):
    name = 'video_6969'

    redis_key = "video6969:start_urls"

    def parse(self, response):
        item = Video6969Item()
        item['html_url'] = response.url
        item['name'] = response.xpath("//h1/text()").extract_first()
        item['video_type'] = response.xpath("//div[@class = 'play_nav hidden-xs']//a/@title").extract_first()
        item['video_url'] = response.selector.re("(https://\w+.xia12345.com/.+?mp4)")[0]
        yield item

其它的从机是不需要crawl_url文件的，它们通过此文件来匹配到电影信息进行下载

item.py文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Video6969Item(scrapy.Item):
    video_type = scrapy.Field()
    name = scrapy.Field()
    html_url = scrapy.Field()
    video_url = scrapy.Field()


class UrlItem(scrapy.Item):
    html_url = scrapy.Field()

pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import pymongo
import redis
import requests
import sys


from Video_6969.items import UrlItem, Video6969Item


#  电影下载
class Video6969Pipeline(object):
    dir_path = r'G:\Video_6969'

    def process_item(self, item, spider):
        if isinstance(item, Video6969Item):
            type_path = os.path.join(self.dir_path, item['video_type'])
            if not os.path.exists(type_path):
                os.makedirs(type_path)
            name_path = os.path.join(type_path, item['name'])
            path = name_path + item['name'] + ".mp4"

            try:
                headers = {
                    "User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.3.2.1000 Chrome/30.0.1599.101 Safari/537.36"
                }
                # now_length = 0  # 已下载大小
                # 循环接收视频数据
                while True:
                    # 若文件已经存在，则断点续传，设置接收来需接收数据的位置
                    if os.path.exists(path):
                        now_length = os.path.getsize(path)
                        print("网络波动继续下载 。已下载：{}MB".format(now_length // 1024 // 1024))
                        headers['Range'] = 'bytes=%d-' % now_length  # 获得本地文件的大小作为续传的起点，还有就是按bytes
                    else:
                        now_length = 0  # 已下载大小
                    res = requests.get(item['video_url'], stream=True,
                                       headers=headers)  # stream设置为True，可以直接访问Response.content属性
                    total_length = int(res.headers['Content-Length'])  # 内容体总大小
                    print("准备下载：【{}】{} {}MB".format(item["video_type"], item["name"], total_length // 1024 // 1024))
                    # 若当前报文长度小于前次报文长度，或者已接收文件等于当前报文长度，则可以认为视频接收完成
                    if total_length < now_length or (
                            os.path.exists(path) and os.path.getsize(path) >= total_length):
                        # print("文件下载完成：【{}】{} {}MB".format(item["video_type"], item["name"], total_length % 1024 % 1024))
                        break

                    # 写入收到的视频数据
                    with open(path, 'ab') as file:
                        for chunk in res.iter_content(chunk_size=1024):
                            # if chunk:
                            file.write(chunk)
                            now_length += len(chunk)
                            # 实时保证一点点的写入
                            file.flush()
                            # 下载实现进度显示
                            done = int(50 * now_length / total_length)
                            sys.stdout.write(
                                "\r【%s%s】%d%%" % ('█' * done, ' ' * (50 - done), 100 * now_length / total_length))
                            sys.stdout.flush()
                    print()

            except Exception as e:
                print(e)
                raise IOError

            print("【{}】{}下载完毕：{}MB".format(item["video_type"], item["name"], now_length // 1024 // 1024))
            return item


# 存储MongoDB
class Video6969Info(object):

    def __init__(self, mongo_host, mongo_db, mongo_coll):
        self.mongo_host = mongo_host
        self.mongo_db = mongo_db
        self.mongo_coll = mongo_coll
        self.count = 0


    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_host=crawler.settings['MONGO_HOST'],
            mongo_db=crawler.settings['MONGO_DB'],
            mongo_coll=crawler.settings['MONGO_COLL']
        )

    def open_spider(self, spider):
        #  连接数据库
        self.client = pymongo.MongoClient(self.mongo_host)
        self.db = self.client[self.mongo_db]  # 获得数据库的句柄
        self.coll = self.db[self.mongo_coll]  # 获得collection的句柄

    def close_spider(self, spider):
        self.client.close()  # 关闭数据库

    def process_item(self, item, spider):
        data = dict(item)  # 把item转换成字典形式
        try:
            self.coll.insert(data)  # 插入
            self.count += 1
        except:
            raise IOError
        if not self.count % 100:
            print("已获取数据：%d条" % self.count)
        return item


# 压入Redis队列
class CrawlUrls(object):
    def process_item(self, item, spider):
        rds = redis.StrictRedis(host='10.36.133.11', port=6379, db=0)
        if isinstance(item, UrlItem):
            rds.lpush("video6969:start_urls", item['html_url'])
        return item

这里用request请求获取我的电影的二进制数据，并进行写入。因为网络波动很容易造成视频文件损坏，所以我又在这里进行了断点下载

crawlall.py文件

from scrapy.commands import ScrapyCommand


class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()

启动

scrapy crawlall

启动后crawl_url爬虫会去爬取url存入redis队列，其它从机获取到url以后开始下载。当然你也可以通过其它的办法来进行分布式的爬取。
注意：保存电影的时候，要注意你是否对改目录有读写的权限

Scrapy_redis分布式爬取某电影网站（断点下载+下载进度条显示）

一、背景介绍

二、代码

你可能感兴趣的:(Scrapy_redis分布式爬取某电影网站（断点下载+下载进度条显示）)