Scrapy实战(爬取图片并保存在本地)

学习了Scrapy,那就先爬点图片看看。

 

首先明确目标,要爬取什么?

Scrapy实战(爬取图片并保存在本地)_第1张图片

 

我们爬取“孔夫子旧书网”所有书籍的图片及信息

Scrapy实战(爬取图片并保存在本地)_第2张图片

 

上面标注的就是我们要爬取的信息,确定了目标,就可以编写items.py

import scrapy


class MyscrapyItem(scrapy.Item):

    # 普通字段
    title = scrapy.Field()
    author = scrapy.Field()
    time = scrapy.Field()
    new_price = scrapy.Field()
    old_price = scrapy.Field()

    # 爬取图片并保存需要的字段
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()

 

分析页面,查看我们要获取的信息的存放位置

Scrapy实战(爬取图片并保存在本地)_第3张图片

 

这是一个页面的信息,当前页面存在下一页的话,我们要获取下一页的信息

Scrapy实战(爬取图片并保存在本地)_第4张图片

 

编写spider.py

import scrapy

import re
from scrapy import Request
from ..items import MyscrapyItem


class KongfzSpider(scrapy.Spider):
    name = 'kongfz'
    allowed_domains = ['kongfz.com']
    start_urls = ['http://item.kongfz.com/Cjisuanji/']

    def parse(self, response):
        divs = response.xpath("//div[@id='listBox']/div")
        for div in divs:

            item = MyscrapyItem()
            item['title'] = div.xpath("./div[@class='item-info']//a/text()").get()
            item['author'] = div.xpath("./div[@class='item-info']//span[1]/text()").get()
            item['time'] = div.xpath("./div[@class='item-info']//span[3]/text()").get()
            item['new_price'] = div.xpath("./div[@class='item-other-info']/div[1]//span[@class='price']/text()").get()
            item['old_price'] = div.xpath("./div[@class='item-other-info']/div[2]//span[@class='price']/text()").get()
            item['image_urls'] = [div.xpath(".//div[@class='big-img-box']/img/@src").get()]

            print(item)
            yield item

        # 翻页
        next_url = response.xpath("//a[@class='next-btn']/@href").get()
        if next_url is not None:
            yield response.follow(next_url, callback=self.parse)

 

编写middlewares.py,在请求每个页面时,需要做反反爬操作

    def process_request(self, request, spider):
        # 在请求页面时伪装成站内请求,用以反 反爬虫
        referer = request.url
        if referer:
            request.headers['referer'] = referer
        return None

 

编写pipelines.py,自定义图片下载设置

from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
# from scrapy import log
import hashlib
from scrapy.utils.python import to_bytes


class MyscrapyPipeline:
    def process_item(self, item, spider):
        # print(item)
        return item


class KongfzImgDownloadPipeline(ImagesPipeline):

    # 设置下载文件请求的请求头
    default_headers = {
        'accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'zh-CN,zh;q=0.9',
        'referer': 'http://item.kongfz.com/Cjisuanji/',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0',
    }

    # 伪装成站内请求,反反爬
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            self.default_headers['referer'] = image_url
            yield Request(image_url, headers=self.default_headers)

    # 自定义 文件路径 和 文件名
    def file_path(self, request, item, response=None, info=None):
        image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest()
        return f'full/{item["title"]}/{image_guid}.jpg'

    # 获取文件的存放路径
    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

 

最后配置settings.py

import os

BOT_NAME = 'myscrapy'

SPIDER_MODULES = ['myscrapy.spiders']
NEWSPIDER_MODULE = 'myscrapy.spiders'


FEED_EXPORT_ENCODING = 'utf-8'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3314.0 Safari/537.36 SE 2.X MetaSr 1.0'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False     # false 表示不遵循robot.txt 协议

# 注释:表示没有开启cookie, false:表示使用setting里设置的cookie, true: 表示使用自定义的cookie
COOKIES_ENABLED = True

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'myscrapy.middlewares.MyscrapyDownloaderMiddleware': 543,   # 启用下载中间件,在请求页面时伪装成站内请求,用以反 反爬虫
}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'myscrapy.pipelines.KongfzImgDownloadPipeline': 300,
}

IMAGES_STORE = 'D:\\images'  # 设置保存图片的根目录

 

各个步骤均已完成,那就开始让你的爬虫表演吧

scrapy crawl kongfz -o kongfz.json

 Scrapy实战(爬取图片并保存在本地)_第5张图片

 

......,,......,去喝了杯水后

看下Kongfz.json

Scrapy实战(爬取图片并保存在本地)_第6张图片

 

再看下

Scrapy实战(爬取图片并保存在本地)_第7张图片

 

灰常完美,一大波美图正等着你去细品。

做下总结,一个scrapy项目通常需要编写了五个文件,按顺序

items.py --> spider.py --> middlewares.py --> pipelines.py --> settings.py

 

孔夫子旧书网的图片都爬下来了,还怕爬取不到其他图片吗

有想法的person可以开始你的表演了

你可能感兴趣的:(python,xpath,cookie,json)