Scrapy框架提供了两个Item Pipeline专门用来下载文件和图片:
* FilesPipeline
* ImagesPipeline
官方文档介绍
可以将他们看作是下载器,使用时通过item的特殊字段将需要下载的文件或图片传递给它们,它们会自动下载到你指定的文件夹,同时将结果存入item的另一个特殊字段,可以输出方便查阅。
matplotlib是非常有用的作图库,其官网上提供了许多应用实例,可在’http://matplotlib.org/examples/index.html’ 查到,我们就把这些文件下载到本地,方便以后查找使用。
例子的链接存在class=“toctree-wrapper compound”的div中的class=“toctree-l1”的li标签中,使用LinkExtractor提取方法可以很方便的提取到页面链接
from scrapy.linkextractors import Linkextractor
le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l1',deny='/index.html$')
#class中出现空格的地方用‘.’代替
#使用restrict_css和deny定位链接出现的地方
links = le.extract_links()
#可以提取出所有链接
进入例子页面,下载链接存在class=“reference external”的‘a’标签中,使用CSS方法提取出来。
href = response.css('a.reference.external::attr(href)').extract_first()
>>>scrapy startproject matpl
>>>cd matpl
>>>scrapy genspider matplot matplotlib.org
ITEM_PIPELINES={
'scrapy.pipelines.files.FilePipeline':1,
}
FILES_STORE = 'examples_src'
}
import scrapy
class MatpItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
file_urls = scrapy.Field()
file = scrapy.Field()
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from ..items import MatpItem
class MatplotSpider(scrapy.Spider):
name = "matplot"
allowed_domains = ["matplotlib.org"]
start_urls = ['http://matplotlib.org/examples/index.html']
def parse(self, response):
le = LinkExtractor(restrict_css='div.toctree-wrapper.compound li.toctree-l1',deny='/index.html$')
print(len(le.extract_links(response)))
for link in le.extract_links(response):
yield scrapy.Request(link.url,callback=self.parse_link)
def parse_link(self,response):
href = response.css('a.reference.external::attr(href)').extract_first()
url = response.urljoin(href)
matpl = MatpItem()
matpl['file_urls'] = [url]
return matpl
from scrapy.pipelines.files import FilesPipeline
from urllib.parse import urlparse
from os.path import basename,dirname,join
class MyFilePipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
path = urlparse(request.url).path
return join(basename(dirname(path)),basename(path))
6.启用自定义filepipeline(这里是对第2步的修改)
ITEM_PIPELINES = {
# 'scrapy.pipelines.files.FilesPipeline':1,
'matp.pipelines.MyFilePipeline':1,
'matp.pipelines.MatpPipeline': 300,
}
FILES_STORE = 'example_src'
7.运行spider
>>>scrapy crawl matplot
会在项目文件夹下生成设置的文件夹来进行保存。
tips:
编写程序时变量名一定要注意复数形式,很多变量都是带S的,还是要细心呀