除了爬取文本,我们可能还需要下载文件、视频、图片、压缩包等,这也是一些常见的需求。scrapy提供了FilesPipeline和ImagesPipeline,专门用于下载普通文件及图片。两者的使用方法也十分简单,首先看下FilesPipeline的使用方式。
FilesPipeline
FilesPipeline的工作流如下:
- 在
spider
中爬取要下载的文件链接,将其放置于item中的file_urls
; -
spider
将其返回并传送至pipeline
链; - 当
FilesPipeline
处理时,它会检测是否有file_urls
字段,如果有的话,会将url传送给scarpy调度器和下载器; - 下载完成之后,会将结果写入item的另一字段
files
,files
包含了文件现在的本地路径(相对于配置FILE_STORE
的路径)、文件校验和checksum
、文件的url
。
从上面的过程可以看出使用FilesPipeline
的几个必须项:
-
Item
要包含file_urls
和files
两个字段; - 打开
FilesPipeline
配置; - 配置文件下载目录
FILE_STORE
。
下面以下载https://twistedmatrix.com/documents/current/core/examples/页面下的python代码为例:
# items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class ExamplesItem(scrapy.Item):
file_urls = scrapy.Field() # 指定文件下载的连接
files = scrapy.Field() #文件下载完成后会往里面写相关的信息
#example.py
# -*- coding: utf-8 -*-
import scrapy
from ..items import ExamplesItem
class ExamplesSpider(scrapy.Spider):
name = 'examples'
allowed_domains = ['twistedmatrix.com']
start_urls = ['https://twistedmatrix.com/documents/current/core/examples/']
def parse(self, response):
urls = response.css('a.reference.download.internal::attr(href)').extract()
for url in urls:
yield ExamplesItem(file_urls = [response.urljoin(url)])
#setting.py
#...
FILES_STORE = '/root/TwistedExamples/file_store'
#...
运行scrapy crawl examples
,然后在FILES_STORE/full
目录下,可以看到已经下载了文件,此时用url的SHA1 hash
来作为文件的名称,后面会讲到如何自定义自己想要的名称。先来看看ImagesPipeline
。
ImagesPipeline
IMagesPipeline的过程与FilePipeline差不多,参数名称和配置不一样;如下:
FilesPipelin | ImagesPipeline | |
---|---|---|
Package | scrapy.pipelines.files.FilesPipeline | scrapy.pipelines.images.ImagesPipeline |
Item | file_urls files |
image_urls images |
存储路径配置参数 | FILES_STROE | IMAGES_STORE |
除此之外,ImagesPipeline还支持以下特别功能:
- 生成缩略图,通过配置
IMAGES_THUMBS = {'size_name': (width_size,heigh_size),}
- 过滤过小图片,通过配置
IMAGES_MIN_HEIGHT
和IMAGES_MIN_WIDTH
来过滤过小的图片。
下面我们以爬取http://image.so.com/z?ch=beauty下美女的图片为例,看下ImagePipeline
是如何生效的。
通过抓取该网站地址的请求,可以发现图片地址是通过接口http://image.so.com/zj?ch=beauty&sn=0&listtype=new&temp=1来获取图片地址的,其中sn=0
表示图片数据的偏移量,默认每次返回30个图片信息,其返回包是一个json字符串,如下:
"end": false,
"count": 30,
"lastid": 30,
"list": [{
"id": "b0cd2c3beced890b801b845a7d2de081",
"imageid": "f90d2737a6d14cbcb2f1f2d5192356dc",
"group_title": "清纯美女户外迷人写真笑颜迷人",
"tag": "萌女",
"grpseq": 1,
"cover_imgurl": "http:\/\/i1.umei.cc\/uploads\/tu\/201608\/80\/0dexb2tjurx.jpg",
"cover_height": 960,
"cover_width": 640,
"total_count": 8,
"index": 1,
"qhimg_url": "http:\/\/p0.so.qhmsg.com\/t017d478b5ab2f639ff.jpg",
"qhimg_thumb_url": "http:\/\/p0.so.qhmsg.com\/sdr\/238__\/t017d478b5ab2f639ff.jpg",
"qhimg_width": 238,
"qhimg_height": 357,
"dsptime": ""
},
......省略
, {
"id": "37f6474ea039f34b5936eb70d77c057c",
"imageid": "3125c84c138f1d31096f620c29b94512",
"group_title": "美女萝莉铁路制服写真清纯动人",
"tag": "萌女",
"grpseq": 1,
"cover_imgurl": "http:\/\/i1.umei.cc\/uploads\/tu\/201701\/798\/kuojthsyf1j.jpg",
"cover_height": 587,
"cover_width": 880,
"total_count": 8,
"index": 30,
"qhimg_url": "http:\/\/p2.so.qhimgs1.com\/t0108dc82794264fe32.jpg",
"qhimg_thumb_url": "http:\/\/p2.so.qhimgs1.com\/sdr\/238__\/t0108dc82794264fe32.jpg",
"qhimg_width": 238,
"qhimg_height": 159,
"dsptime": ""
}]
}
我们可以通过返回包的qhimg_url
获取图片的链接,具体代码如下:
#items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class BeautyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
#beauty.py
# -*- coding: utf-8 -*-
import scrapy
import json
from ..items import BeautyItem
class BeautypicSpider(scrapy.Spider):
name = 'beautypic'
allowed_domains = ['image.so.com']
url_pattern = 'http://image.so.com/zj?ch=beauty&sn={offset}&listtype=new&temp=1'
# start_urls = ['http://image.so.com/']
def start_requests(self):
step = 30
for page in range(0,3):
url = self.url_pattern.format(offset = page*step)
yield scrapy.Request(url, callback = self.parse)
def parse(self, response):
ret = json.loads(response.body)
for row in ret['list']:
yield BeautyItem(image_urls=[row['qhimg_url']], name = row['group_title'])
#settings.py
# Obey robots.txt rules
ROBOTSTXT_OBEY =False
ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':5,
}
IMAGES_STORE = '/root/beauty/store_file'
下载下来的图片文件如下:
修改文件默认名
从FilePipeline和ImagePipeline中可以看到下载的文件名都比较怪异,不太直观,这些文件名使用的是url地址的sha1散列值,主要用于防止重名的文件相互覆盖,但有时我们想文件按照我们的期望来命名。比如对于下载文件,通过查看FilesPipeline的源码,可以发现文件名主要由FilesPipeline.file_path
来决定的,部分代码如下:
class FilesPipeline(MediaPipeline):
...
def file_path(self, request, response=None, info=None):
## start of deprecation warning block (can be removed in the future)
def _warn():
from scrapy.exceptions import ScrapyDeprecationWarning
import warnings
warnings.warn('FilesPipeline.file_key(url) method is deprecated, please use '
'file_path(request, response=None, info=None) instead',
category=ScrapyDeprecationWarning, stacklevel=1)
# check if called from file_key with url as first argument
if not isinstance(request, Request):
_warn()
url = request
else:
url = request.url
# detect if file_key() method has been overridden
if not hasattr(self.file_key, '_base'):
_warn()
return self.file_key(url)
## end of deprecation warning block
media_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation
media_ext = os.path.splitext(url)[1] # change to request.url after deprecation
return 'full/%s%s' % (media_guid, media_ext)
...
因此我们可以通过继承FilesPipeline重写file_path()
方法来重定义文件名,新的自定义SelfDefineFilePipline
如下:
#pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
import os
class MatplotlibExamplesPipeline(object):
def process_item(self, item, spider):
return item
class SelfDefineFilePipline(FilesPipeline):
"""
继承FilesPipeline,更改其存储文件的方式
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def file_path(self, request, response=None, info=None):
parse_result = urlparse(request.url)
path = parse_result.path
basename = os.path.basename(path)
return basename
在配置文件settings.py
中打开SelfDefineFilePipline
并运行爬虫,以下为下载结果。
这里讲的只是其中一种方法,主要是为了提供一种思路,更改文件名的方法有很多,要看具体场景,比如下载图片那一节,url并没有带图片的名称,那么通过只更改file_path()方法来命名应该不可能,因为item['name']并没有传进来,通过查找源码,发现在get_media_requests()方法中是通过Request来下载图片的,这个方法里面也有带item信息,可以将item['name']在Request的meta参数传递,在file_path()方法就能获取到外部传进来的名字。所以看源码其实也是学习框架的一种方式。
总结
本篇讲了如何使用scrapy自带的FilesPipeline和ImagesPipeline来下载文件和图片,然后讲了如何通过继承并重写上述类的方法来重定义文件名的命名方法。下一篇主要学习下LineExtractor快速提前链接和Exporter导出结果到文件。