Scrapy
框架的概念与执行流程学习Scrapy
的原因
1、Scrapy
不能解决剩下的10%
的爬虫需求
2、能够让开发过程方便、快速
3、Scrapy
框架能够让我们的爬虫效率更高
什么是Scrapy
文档地址:
中文:https://www.osgeo.cn/scrapy/
英文:https://docs.scrapy.org/en/latest/
Scrapy
使用了Twisted['twistid']
异步网络框架、可以加快我们的下载速度
Scrapy
是一个为了爬虫网站数据,提取结构性数据而编写的应用框架,我们只需要实现少量的代码,就能够快速的抓取
异步与非阻塞的区别
使用Scrapy
scrapy startproject my_spider
scrapy genspider baidu baidu.com
运行Scrapy脚本,要在根目录下
scrapy crawl baidu
案例 - 抓取蜻蜓FM
scrapy startproject fm
cd fm
scrapy genspider qingting https://m.qingting.fm/rank/
cd fm # 切换到有spiders的目录
scrapy crawl qingting # 运行
通过
cmdline
来快速启动
from scrapy import cmdline
pass
if __name__ == '__main__':
cmdline.execute("scrapy crawl qingting".split())
解析函数中的yield
能够传递的对象只能是:BaseItem
、Request
、dict
、None
Spiders(类) -> requests(对象) -> ScrapyEngine(引擎) -> Scheduler(队列)[ request, request, request ] -> Downloader(下载器) -> 返回给引擎 -> 提交给Spider -> 执行parse函数
SpiderMiddlewares(过滤和发送请求之前 可以运行的代码)
DownloaderMiddlewares( 下载器中间件 )
Item( 数据校验 )
Pipeline( 数据保存 )
scrapy
进行爬虫案例 - 蜻蜓FM
qingting.py
import scrapy
from scrapy import cmdline
from scrapy.http import HtmlResponse
class QingtingSpider(scrapy.Spider):
name = "qingting"
allowed_domains = ["m.qingting.fm", "pic.qtfm.cn"]
start_urls = ["https://m.qingting.fm/rank/"]
def parse(self, response, **kwargs):
a_list = response.xpath('//div[@class="rank-list"]/a')
# print(a_list)
for a_temp in a_list:
rank_number = a_temp.xpath('./div[@class="badge"]/text()').extract_first()
title = a_temp.xpath('.//div[@class="title"]/text()').extract_first()
rank_desc = a_temp.xpath('.//div[@class="desc"]/text()').extract_first()
img_url = a_temp.xpath('./img/@src').extract_first()
# print(rank_number, title, rank_desc, img_url)
yield {
'type' : 'info',
'rank_number' : rank_number,
'title' : title,
'img_url' : img_url,
'desc' : rank_desc
}
# 需要在parse函数中 重新构建一个新的request对象并对图片地址发送请求
# 如果解析函数中存在自定义形参 则 需要cb_kwargs进行传参,key值必须和形参名相同
yield scrapy.Request(img_url, callback=self.image_parse, cb_kwargs={
'image_name' : title})
# 图片解析函数
@staticmethod
def image_parse(response: HtmlResponse, image_name):
yield{
'type' : 'image',
'image_name' : image_name+'.jpg',
'image_content' : response.body,
}
if __name__ == '__main__':
cmdline.execute("scrapy crawl qingting".split())
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import os
import pymongo
class FmPipeline:
def process_item(self, item, spider):
# 1. 获取yield
type = item.get('type')
if type == 'image':
download_url = os.getcwd() + '/download/'
if not os.path.exists(download_url):
os.mkdir(download_url)
image_name = item.get('image_name')