Scrapy爬虫实战项目【002】 - 抓取360摄影美图

爬取360摄影美图

参考来源:《Python3网络爬虫开发实战》 第497页 作者:崔庆才

目的:使用Scrapy爬取360摄影美图,保存至MONGODB数据库并将图片下载至本地

目标网址:http://image.so.com/z?ch=photography

分析/知识点:

  1. 爬取难度:
    a. 入门级,静态网页中不含图片信息,通过AJAX动态获取图片并渲染,返回结果为JSON格式;

  2. 图片下载处理:使用内置的ImagesPipeline,进行少量方法改写;

  3. MONGODB存储;

实际步骤:

  1. 创建Scrapy项目/images(spider)
Terminal: > scrapy startproject images360
Terminal: > scrapy genspider images image.so.com
  1. 配置settings.py文件
# MONGODB配置
MONGO_URI = 'localhost'
MONGO_DB = 'images360'

# 下载图片默认保存目录(ImagePipelin要用到)
IMAGES_STORE = './images'

# 嘿嘿嘿...
ROBOTSTXT_OBEY = False

# headers
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}

# 启用Pipeline(ImagePipeline优先级要最高)
ITEM_PIPELINES = {
    'images360.pipelines.ImagePipeline': 300,
    'images360.pipelines.MongoPipeline': 301,
}
  1. 编写items.py文件
from scrapy import Item, Field

# 图片信息全部获取
class MovieItem(Item):
    cover_height = Field()
    cover_imgurl = Field()
    cover_width = Field()
    dsptime = Field()
    group_title = Field()
    grpseq = Field()
    id = Field()
    imageid = Field()
    index = Field()
    label = Field()
    qhimg_height = Field()
    qhimg_thumb_url = Field()
    qhimg_url = Field()
    qhimg_width = Field()
    tag = Field()
    total_count = Field()
  1. 编写pipelines.py文件
    a) ImagePipeline: 根据Scrapy官方文档修改:
    Downloading and processing files and images:
# 图片下载Pipeline
class ImagePipeline(ImagesPipeline):
    def file_path(self, request, response=None, info=None):
        '''
        重写file_path方法,获取图片名
        '''
        url = request.url
        file_name = url.split('/')[-1]
        return file_name

    def item_completed(self, results, item, info):
        '''
        将下载失败的图片剔除,不保存至数据库
        '''
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Image Downloaded Failed')
        return item


    def get_media_requests(self, item, info):
        '''
        重新请求图片url,调度器重新安排下载
        '''
        yield Request(url=item['qhimg_url'])

b) MongoPipeline: 根据Scrapy官方文档修改:https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=mongo 代码略

5. 编写spiders > images.py文件
注意:
a) 重写start_requests(self);
b) 动态获取请求url;动态Field赋值并生成对应的ImageItem

# 每张图片动态赋值并生产ImageItem
for image in images:
    item = ImageItem()
    for field in item.fields:
        if field in image.keys():
            item[field] = image.get(field)
    yield item

c) 完整代码如下:

import json
from scrapy import Spider, Request
from images360.items import ImageItem

class ImagesSpider(Spider):
    name = 'images'
    # allowed_domains = ['image.so.com']
    # start_urls = ['http://image.so.com/z?ch=photography']

    url = 'http://image.so.com/zj?ch=photography&sn={sn}&listtype=new&temp=1'

    # 重写
    def start_requests(self):
        # 循环生产请求前1200张照片(sn = [1-41])
        for sn in range(1, 41):
            yield Request(url=self.url.format(sn=sn * 30), callback=self.parse)

    def parse(self, response):
        results = json.loads(response.text)
        # 判断list是否在results的keys中
        if 'list' in results.keys():
            images = results.get('list')

        # 每张图片动态赋值并生产ImageItem
        for image in images:
            item = ImageItem()
            for field in item.fields:
                if field in image.keys():
                    item[field] = image.get(field)
            yield item

6. 运行结果
Scrapy爬虫实战项目【002】 - 抓取360摄影美图_第1张图片
Scrapy爬虫实战项目【002】 - 抓取360摄影美图_第2张图片

小结

  1. 入门级项目,进一步熟悉Scrapy的使用流程;
  2. 熟悉网页AJAX返回结果的获取和解析;
  3. 初步了解ImagesPipeline的使用方法,以及学会如何根据需要进行改写。

你可能感兴趣的:(scrapy爬虫项目)