利用scrapy爬虫时我们经常会遇到列表页可以爬取一些信息,详情页又可以爬到一些信息。同时详情页的url需要在列表页请求之后才可以获得。因此就需要垂直爬取,也就是先请求获得详情页的html,解析出详情页后再去请求以获得详情页的内容。同时,如果此时需要保存一些数据,如:列表页保存几个数据,详情页也需要保存几个数据,此时就需要设置多个item来获得。因此,本文记录了遇到垂直爬取与多个item保存并且下载某个item中内容的方法。
1 垂直爬取
垂直爬取其实比较简单,主要是就是利用yield Request()以及创建parse函数,Request去请求url将返回内容回调给parse_()函数,parse_()函数解析产生新的url及其他内容,若想进一步请求新的url,则继续去yield Request ,从而垂直爬取下去。
import scrapy
from scrapy import Request
from xiezhen.items import XiezhenItem
import time
import random
import re
from scrapy.http import request
import math
from xiezhen.items import ModelItem
class XzSpider(scrapy.Spider):
name = 'xz'
allowed_domains = ['tujigu.com']
def start_requests(self):
for i in range(1, 2):
if i == 1:
url = 'https://www.tujigu.com/riben/'
else:
url = 'https://www.tujigu.com/riben/' + str(i) + '.html'
time.sleep(random.randint(1, 3))
yield Request(url=url, callback=self.parse_list)
def parse_list(self, response):
html = response.css('.hezi ul li')
for url in html:
item2 = XiezhenItem()
item2['list_url'] = url.css('a::attr(href)').extract_first()
item2['title'] = url.css('p.biaoti a::text ').extract_first()
item2['jigou'] = url.css('p a::text').extract()[0]
item2['model'] = url.css('p a::text').extract()[1]
item2['biaoqian'] = url.css('p a::text').extract()[2:-1]
item2['pic_num'] = url.css('span.shuliang::text ').extract()[0][:-1]
time.sleep(random.random())
yield item2
pic_num = url.css('span.shuliang::text ').extract()[0][:-1]
if int(pic_num) >= 100:
page = math.ceil(int(pic_num) / 6)
else:
page = math.ceil(int(pic_num) / 5)
for i in range(1, page + 1):
if i == 1:
url = item2['list_url']
else:
url = item2['list_url'] + str(i) + '.html'
yield Request(url=url, callback=self.parse_item)
def parse_item(self, response):
item = ModelItem()
item['pic_urls'] = response.css('.content img::attr(src)').extract()
item['model'] = response.css('.tuji p a::text').extract()[-2]
item['model_info'] = response.css('.tuji p::text').extract()[-1].replace('\n', '').replace("''", '')
yield item
2 对于处理多个item ,首先在item中创建多个Item类,其继承于scrapy的Item。通过将两个item类分别倒入spider中,分别实例化,便可以将数据存入想要存入的item中。最后将其yield。
import scrapy
from scrapy import Item
class ModelItem(Item):
model = scrapy.Field()
model_info = scrapy.Field()
pic_urls = scrapy.Field()
class XiezhenItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
jigou = scrapy.Field()
model = scrapy.Field()
biaoqian = scrapy.Field()
title = scrapy.Field()
list_url = scrapy.Field()
pic_num = scrapy.Field()
前边这部分并不会出现什么问题。主要问题出现在通过pipeline下载图片或文件过程中。此时需要指定Pipeline操作的是哪个item对象。代码如下:通过if isinstance(item,(想要操作的item类名)),从而选定了当下Pipeline此时操作的item。从而使得Pipeline 去下载响应item中的内容。当前在下载之前我们需要在setting中指定下载的位置,设定ITEM_PIPELINES等就可以开始下载了。其setting设置见最下。
from scrapy import Request
from xiezhen.items import ModelItem
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
# class XiezhenPipeline:
# def process_item(self, item, spider):
# return item
class ImagePipeline(ImagesPipeline):
# def process_item(self, item, spider):
# return item
def get_media_requests(self, item, info):
if isinstance(item, ModelItem):
urls = item['pic_urls']
for i in urls:
yield Request(url=i)
def file_path(self, request, response=None, info=None):
url = request.url
file_name = url.split('/')[-2] + url.split('/')[-1]
return file_name
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem('Image Download Filed')
return item
IMAGES_STORE = './images'
ITEM_PIPELINES = {
'xiezhen.pipelines.ImagePipeline': 1,
}