python爬虫08:scrapy下载图片

scrapy下载图片

可以去看scrapy官方文档学习如何配置来下载文件
https://docs.scrapy.org/en/latest/topics/media-pipeline.html
要挂梯子

如果要下载cnblogs新闻页封面图片,不妨先在工作目录下建立一个images文件夹,之后要修改一下settings.py

头部加上

import sys
import os

中间修改成

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
   'scrapy.pipelines.images.ImagesPipeline': 1
}

尾部加上

IMAGES_URLS_FIELD = "front_image_url"
# print(os.path.dirname(os.path.abspath(__file__)))
project_dir = os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')

之前的主页面中,有几个坑

一开始封面图片网址的提取要处理一下

            if not image_url.startswith('h'):
                image_url = 'https:'+image_url

后面的articleitem的网址要传入列表

            if response.meta.get("front_image_url", ""):
                article_item['front_image_url'] = [response.meta.get("front_image_url", "")] # 不用get可能会抛异常
                '''大坑,封面图片链接需要传入list,不然报错'''
            else:
                article_item['front_image_url'] = []




我们其实可以控制pipline下载过程
piplines.py加上

from scrapy.pipelines.images import ImagesPipeline
class ArticleImagePiplines(ImagesPipeline):
    def item_completed(self, results, item, info):
        if "front_image_url" in item:
            image_file_path
            for ok, value in results:
                image_file_path = value['path']
            item['front_image_path'] = image_file_path
            
        return item

同时settings.py也要修改

ITEM_PIPELINES = {
   'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
   'ArticleSpider.pipelines.ArticleImagePiplines': 1
}

文件不一定要存在本地,scrapy也支持云端存储

你可能感兴趣的:(网络小偷之路)