scrapy 及 scrapy-redis 简介
演讲目录
一、简介
1、scrapy简介
Scrapy是一个快速的高级web爬行和web爬行框架,用于爬行网站并从其页面中提取结构化数据。它可以用于广泛的用途,从数据挖掘到监控和自动化测试。
- 官方文档: https://docs.scrapy.org/en/latest/index.html
2、scrapy简介
基于redis的分布式爬虫。
官方介绍特点第一句话是:您可以启动共享单个redis队列的多个spider实例。最适合广泛的多域爬网。
- 官方文档: https://scrapy-redis.readthedocs.io/en/stable/
讲解信息
Python 3.7.7
Scrapy 2.5.1
scrapy-redis 0.7.1
爬虫html,指向localhost即可
企业列表
小米科技有限责任公司 @北京
小米有品科技有限公司@江苏
二、爬虫基本流程及目标
准备待爬任务(输入) -> 爬虫消费 -> 清洗入库(输出)
目标 :爬取首页数据,获取公司地区
输入: url
输出:json格式
{
"company_name":"xxxx",
"province":"xxx"
}
三、scrapy 爬虫demo-输出到文件
- 启动服务nodejs服务
cd ./web_demo
node index.js
- 创建爬虫项目
scrapy -h 查看全部命令
scrapy startproject company
cd company
- 创建爬虫
scrapy genspider companyindex localhost
- 修改爬虫文件
# -*- coding: utf-8 -*-
import scrapy
from loguru import logger as log
class CompanyindexSpider(scrapy.Spider):
name = 'companyindex'
allowed_domains = ['localhost']
start_urls = ['http://localhost:3000']
def parse(self, response):
log.info(response.body)
- 运行爬虫
scrapy crawl companyindex --nolog
我们可以看到已经获取了目标网站的正文
- 解析网站获取基础数据
可以使用scrapy shell 工具
scrapy shell 'http://localhost:3000'
response.xpath('//div[@class="word"]/text()').getall()
继续修改companyindex代码
# -*- coding: utf-8 -*-
import scrapy
from loguru import logger as log
class CompanyindexSpider(scrapy.Spider):
name = 'companyindex'
allowed_domains = ['localhost']
start_urls = ['http://localhost:3000']
def parse(self, response):
context = response.xpath('//div[@class="word"]/text()').getall()
for context_item in context:
company_name,province = context_item.split('@')
re_data = {
"company_name":company_name,
"province":province
}
log.info(re_data)
yield re_data
运行命令行
scrapy crawl companyindex --nolog -o companys.jl # 将爬虫输出到文件
四、scrapy 爬虫demo-输出定义,装饰
修改items.py 文件:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
from itemloaders.processors import Join, MapCompose, TakeFirst
def add_trim(str):
return str.strip();
class CompanyItem(scrapy.Item):
# define the fields for your item here like:
company_name = scrapy.Field(
input_processor=MapCompose(add_trim),
output_processor=TakeFirst()
)
tag = scrapy.Field(
output_processor=Join(',')
)
province = scrapy.Field(
output_processor=TakeFirst()
)
修改爬虫文件
# -*- coding: utf-8 -*-
import scrapy
from loguru import logger as log
from company.items import CompanyItem
from scrapy.loader import ItemLoader
class CompanyindexSpider(scrapy.Spider):
name = 'companyindex'
allowed_domains = ['localhost']
start_urls = ['http://localhost:3000']
def parse(self, response):
context = response.xpath('//div[@class="word"]/text()').getall()
for context_item in context:
l = ItemLoader(item=CompanyItem(), response=response)
company_name,province = context_item.split('@')
l.add_value("company_name",company_name)
l.add_value("tag",'test') #新增爬虫标签 环境
l.add_value("tag",'20211125') #新增爬虫标签 年月日
l.add_value("province",province)
yield l.load_item()
再次运行命令
scrapy crawl companyindex --nolog -o companys.jl
五、scrapy 爬虫demo-通过管道存储内容
以上操作都是通过 -o 选项输出到指定文件,如何能输出到指定媒介呢?例如,mysql\等。scrapy 的Item Pipeline 就是实现这个的
- 打开配置文件 settings.py 中 ITEM_PIPELINES 配置项
ITEM_PIPELINES = {
'company.pipelines.CompanyPipeline': 300,
}
后面的数字是优先级:越小越高
假设将目标文件写入指定文件
pipelines.py 修改如下:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from itemadapter import ItemAdapter
import json
class CompanyPipeline(object):
def open_spider(self, spider):
self.file = open('company_items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(ItemAdapter(item).asdict()) + "\n"
self.file.write(line)
return item
运行命令
scrapy crawl companyindex
五、scrapy-redis 简介
网上通用的是重写 make_request_from_data 方法
文件:companyindex.py
# -*- coding: utf-8 -*-
from loguru import logger as log
from company.items import CompanyItem
from scrapy.loader import ItemLoader
from scrapy_redis.spiders import RedisSpider
from scrapy.http import Request
import json
class CompanyindexSpider(RedisSpider):
name = 'companyindex'
allowed_domains = ['localhost']
def make_request_from_data(self, data):
try:
task_params = json.loads(data)
log.info(task_params)
return self.make_requests_from_url(task_params)
except Exception:
log.info('parse json error')
return None
def make_requests_from_url(self,task_params):
""" This method is deprecated. """
url = task_params.get("url")
log.info(f"获取任务url:{url}")
return Request(url, dont_filter=True,callback=self.company_parse)
def company_parse(self, response):
context = response.xpath('//div[@class="word"]/text()').getall()
for context_item in context:
l = ItemLoader(item=CompanyItem(), response=response)
company_name,province = context_item.split('@')
l.add_value("company_name",company_name)
l.add_value("tag",'test')
l.add_value("tag",'20211125')
l.add_value("province",province)
yield l.load_item()
settings.py 文件
BOT_NAME = 'company'
SPIDER_MODULES = ['company.spiders']
NEWSPIDER_MODULE = 'company.spiders'
ROBOTSTXT_OBEY = True
REDIS_URL = 'redis://127.0.0.1:2888/7'
ITEM_PIPELINES = {
'company.pipelines.CompanyPipeline': 300,
'scrapy_redis.pipelines.RedisPipeline': 301
}
REDIS_START_URLS_KEY = "scrapy_companyindex_spider"
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
STATS_CLASS = "scrapy_redis.stats.RedisStatsCollector"
pipelines.py 文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from itemadapter import ItemAdapter
import json
from loguru import logger as log
class CompanyPipeline(object):
def open_spider(self, spider):
log.info(" CompanyPipeline open_spider--------")
self.file = open('company_items.jl', 'w')
# def close_spider(self, spider):
# log.info(" CompanyPipeline close_spider--------")
def process_item(self, item, spider):
log.info(" CompanyPipeline process_item--------")
line = json.dumps(ItemAdapter(item).asdict()) + "\n"
self.file.write(line)
self.file.close()
return item
其他
中间件:爬虫中间件 下载中间件
速度控制
如何进行聚合页面爬取? 如详情页面包含多个资源链接
使用用 meta