一、背景介绍
- 操作系统及环境
操作系统:Win10(主)、Ubuntu(从)
Python版本:Python3.6
Scrapy版本:Scrapy1.5.1
scrapy_redis:两台电脑都需要安装
redis数据库:主服务器的redis数据库要运行远程连接
因为只是为了分享如何进行简单的分布式爬取,所以选取了一个结构比较简单的网站(网址不适合公开,仅作学习用途)
二、代码
主要思路
使用scrapy_redis的框架来实现该网站的分布式爬取。总共分成如下几个步骤:
1、第一个爬虫抓取需要下载的url信息存入reids数据库的队列(只需要放在主服务器)。从机通过redis数据库的队列来获取需要去抓取的url
2、第二个爬虫获取电影的信息,并将信息放回pipelines进行持久化存储
3、下载电影时配置断点下载以及进度条的显示项目目录结构
- crawlall.py文件:负责启动多个爬虫
- crawl_url.py文件:负责抓取url,保存到redis队列
- video_6969.py文件:爬取电影
- items.py文件:保存电影字段
- pipelines.py文件:下载电影、断点下载、下载进度条、保存到redis数据库
- settings.py文件:配置信息
- 先配置我们的settings.py文件
# -*- coding: utf-8 -*-
# Scrapy settings for Video_6969 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Video_6969'
SPIDER_MODULES = ['Video_6969.spiders']
NEWSPIDER_MODULE = 'Video_6969.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Video_6969 (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 150
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 200
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Video_6969.middlewares.Video6969SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Video_6969.middlewares.Video6969DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 分布式爬虫的数据可以不通过本地的管道(数据不需要存在本地)。数据要存入到redis数据库中,所以这里需要加入一个reids数据库的管道组件
'Video_6969.pipelines.Video6969Pipeline': 300,
"scrapy_redis.pipelines.RedisPipeline": 100, # item数据会报错到redis
"Video_6969.pipelines.CrawlUrls": 50,
# 'Video_6969.pipelines.Video6969Info': 200,
}
# 指定Redis数据库相关的配置
# Redis的主机地址
REDIS_HOST = '10.36.133.11' # 主机
REDIS_PORT = 6379 # 端口
# REDIS_PARAMS = {"password": "xxxx"} # 密码
# 调度器需要切换成Scrapy_Redis的调度器(是Scrapy_Redis组件对原生调度器的重写,加入了一些分布式调度的算法)
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 加入scrapy_redis的去重组件
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 爬取过程中是否运行暂停
SCHEDULER_PERSIST = True
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 日志
# 关闭日志或调整Debug级别
# LOG_ENABLED = False
# LOG_LEVEL = 'ERROR'
LOG_LEVEL = 'DEBUG'
"""
CRITICAL - 严重错误
ERROR - 一般错误
WARNING - 警告信息
INFO - 一般信息
DEBUG - 调试信息
"""
# 日志文件
LOG_FILE = '6969.log'
# 是否启用日志(创建日志后,不需开启,进行配置)
LOG_ENABLED = True # (默认为True,启用日志)
# 如果是True ,进程当中,所有标准输出(包括错误)将会被重定向到log中
LOG_STDOUT = False
# 日志编码
LOG_ENCODING = 'utf-8'
# 配置启动所有爬虫
COMMANDS_MODULE = 'Video_6969.commands'
# MongoDB配置
MONGO_HOST = "127.0.0.1" # 主机IP
MONGO_PORT = 27017 # 端口号
MONGO_DB = "6969" # 库名
MONGO_COLL = "ViodeInfo" # collection名
# 如果有用户名和密码
# MONGO_USER = "zhangsan"
# MONGO_PSW = "123456"
注意:现在爬虫要继承自RedisCrawlSpider,且urls要从redis数据库中根据redis_key配置的值进行获取,所以我们要将start_urls注释。后面我们将在redis配置我们的起始url。
- crawl_url.py文件
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item, UrlItem
class Video6969(CrawlSpider):
name = 'crawl_urls'
start_urls = ['https://www.6969qq.com']
rules = (
Rule(LinkExtractor(allow=r'/html/\d+/'), follow=True), # 分类
Rule(LinkExtractor(allow=r'/vod/\d+/.+?html'), callback='video_info', follow=True), # 更多
)
def video_info(self, response):
item = UrlItem()
item['html_url'] = response.url
yield item
crawl_url.py文件负责抓取我们需要下载的url页面,再通过pipelines存储到redis队列中。(也可以直接在crawl_url里进行持久化存储)
- video_6969.py文件
# -*- coding: utf-8 -*-
from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item
class Video6969(RedisCrawlSpider):
name = 'video_6969'
redis_key = "video6969:start_urls"
def parse(self, response):
item = Video6969Item()
item['html_url'] = response.url
item['name'] = response.xpath("//h1/text()").extract_first()
item['video_type'] = response.xpath("//div[@class = 'play_nav hidden-xs']//a/@title").extract_first()
item['video_url'] = response.selector.re("(https://\w+.xia12345.com/.+?mp4)")[0]
yield item
其它的从机是不需要crawl_url文件的,它们通过此文件来匹配到电影信息进行下载
- item.py文件
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class Video6969Item(scrapy.Item):
video_type = scrapy.Field()
name = scrapy.Field()
html_url = scrapy.Field()
video_url = scrapy.Field()
class UrlItem(scrapy.Item):
html_url = scrapy.Field()
- pipelines.py文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import pymongo
import redis
import requests
import sys
from Video_6969.items import UrlItem, Video6969Item
# 电影下载
class Video6969Pipeline(object):
dir_path = r'G:\Video_6969'
def process_item(self, item, spider):
if isinstance(item, Video6969Item):
type_path = os.path.join(self.dir_path, item['video_type'])
if not os.path.exists(type_path):
os.makedirs(type_path)
name_path = os.path.join(type_path, item['name'])
path = name_path + item['name'] + ".mp4"
try:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.3.2.1000 Chrome/30.0.1599.101 Safari/537.36"
}
# now_length = 0 # 已下载大小
# 循环接收视频数据
while True:
# 若文件已经存在,则断点续传,设置接收来需接收数据的位置
if os.path.exists(path):
now_length = os.path.getsize(path)
print("网络波动继续下载 。已下载:{}MB".format(now_length // 1024 // 1024))
headers['Range'] = 'bytes=%d-' % now_length # 获得本地文件的大小作为续传的起点,还有就是按bytes
else:
now_length = 0 # 已下载大小
res = requests.get(item['video_url'], stream=True,
headers=headers) # stream设置为True,可以直接访问Response.content属性
total_length = int(res.headers['Content-Length']) # 内容体总大小
print("准备下载:【{}】{} {}MB".format(item["video_type"], item["name"], total_length // 1024 // 1024))
# 若当前报文长度小于前次报文长度,或者已接收文件等于当前报文长度,则可以认为视频接收完成
if total_length < now_length or (
os.path.exists(path) and os.path.getsize(path) >= total_length):
# print("文件下载完成:【{}】{} {}MB".format(item["video_type"], item["name"], total_length % 1024 % 1024))
break
# 写入收到的视频数据
with open(path, 'ab') as file:
for chunk in res.iter_content(chunk_size=1024):
# if chunk:
file.write(chunk)
now_length += len(chunk)
# 实时保证一点点的写入
file.flush()
# 下载实现进度显示
done = int(50 * now_length / total_length)
sys.stdout.write(
"\r【%s%s】%d%%" % ('█' * done, ' ' * (50 - done), 100 * now_length / total_length))
sys.stdout.flush()
print()
except Exception as e:
print(e)
raise IOError
print("【{}】{}下载完毕:{}MB".format(item["video_type"], item["name"], now_length // 1024 // 1024))
return item
# 存储MongoDB
class Video6969Info(object):
def __init__(self, mongo_host, mongo_db, mongo_coll):
self.mongo_host = mongo_host
self.mongo_db = mongo_db
self.mongo_coll = mongo_coll
self.count = 0
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_host=crawler.settings['MONGO_HOST'],
mongo_db=crawler.settings['MONGO_DB'],
mongo_coll=crawler.settings['MONGO_COLL']
)
def open_spider(self, spider):
# 连接数据库
self.client = pymongo.MongoClient(self.mongo_host)
self.db = self.client[self.mongo_db] # 获得数据库的句柄
self.coll = self.db[self.mongo_coll] # 获得collection的句柄
def close_spider(self, spider):
self.client.close() # 关闭数据库
def process_item(self, item, spider):
data = dict(item) # 把item转换成字典形式
try:
self.coll.insert(data) # 插入
self.count += 1
except:
raise IOError
if not self.count % 100:
print("已获取数据:%d条" % self.count)
return item
# 压入Redis队列
class CrawlUrls(object):
def process_item(self, item, spider):
rds = redis.StrictRedis(host='10.36.133.11', port=6379, db=0)
if isinstance(item, UrlItem):
rds.lpush("video6969:start_urls", item['html_url'])
return item
这里用request请求获取我的电影的二进制数据,并进行写入。因为网络波动很容易造成视频文件损坏,所以我又在这里进行了断点下载
- crawlall.py文件
from scrapy.commands import ScrapyCommand
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
- 启动
scrapy crawlall
启动后crawl_url爬虫会去爬取url存入redis队列,其它从机获取到url以后开始下载。当然你也可以通过其它的办法来进行分布式的爬取。
注意:保存电影的时候,要注意你是否对改目录有读写的权限