Scrapy是适用于Python的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy Engine(引擎)
负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等。
Scheduler(调度器)
它负责接受引擎发送过来的Request请求,并按照一定的方式进行整理排列,入队,当引擎需要时,交还给引擎。
Downloader(下载器)
负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses交还给Scrapy Engine(引擎),由引擎交给Spider来处理。
Spider(爬虫)
它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给引擎,再次进入Scheduler(调度器)。
Item Pipeline(管道)
它负责处理Spider中获取到的Item,并进行进行后期处理(详细分析、过滤、存储等)的地方。
Downloader Middlewares(下载中间件)
一个可以自定义扩展下载功能的组件。
Spider Middlewares(Spider中间件)
一个可以自定扩展和操作引擎和Spider中间通信的功能组件
网址:职友集阿里招聘网
进入PyCharm终端使用scrapy startproject Alibaba[项目名,自己定义]创建自己的scrapy项目
命令如下:
scrapy startproject Alibaba
进入Alibaba内编辑items.py。代码如下
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
#定义一个item类继承scrapy.Item
class AlibabaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#定义职位名称数据属性
job_title = scrapy.Field()
#定义就职地点数据属性
address = scrapy.Field()
#定义招聘要求数据属性
detail = scrapy.Field( )
pass
在PyCharm终端进入Alibaba文件夹内运行scrapy genspider example example.com命令,其中example是为这个爬虫定义的名字,可自定义example.com是爬虫要爬取的网址。
代码如下
scrapy genspider alibaba jobui.com
代码如下
import scrapy
#导入item类
from ..items import AlibabaItem
#导入bs4用于解析数据
import bs4
#定义一个爬虫类,用于继承scrapy.Spider类
class AlibabaSpider(scrapy.Spider):
#定义爬虫名字,这是唯一属性
name = 'AliBaba'
#定义爬虫网络域名,只允许在该域名内爬取
allowed_domains = ['www.jobui.com']
#设置爬虫起始爬取的url
start_urls = ['https://www.jobui.com/company/281097/jobs/p1']
#使用for遍历网址
for page in range(1,601):
url = 'https://www.jobui.com/company/281097/jobs/p{i}'.format(i=page)
#将网址添加进start_urls内
start_urls.append(url)
#parse是默认处理reponse的方法
def parse(self, response):
#使用BeautifulSoup解析对象
bs = bs4.BeautifulSoup(response.text,'html.parser')
#用find_all提取标签信息,里面包含所有的招聘信息
all_knowledge = bs.find_all('div',class_="c-job-list")
#使用for循环遍历all_knowledge
for data in all_knowledge:
#实例化AlibabaItem这个类
item = AlibabaItem()
#提取招聘岗位信息
item['job_title'] = data.find_all('div',class_="job-segmetation")[0].a.h3.text
#提取工作地点信息
item['address'] = data.find_all('div',class_="job-segmetation")[1].find_all('span')[0].text
#提取招要求信息
item['detail'] = data.find_all('div',class_="job-segmetation")[1].find_all('span')[1].text
#使用yield将item返还给引擎
yield item
pass
4.4.3 修改 settings.py
代码如下
# Scrapy settings for Alibaba project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Alibaba'
SPIDER_MODULES = ['Alibaba.spiders']
NEWSPIDER_MODULE = 'Alibaba.spiders'
#导出文件的路径
FEED_URI='%(name)s.csv'
#导出的数据格式
FEED_FORMAT='csv'
#导出文件编码
FEED_EXPORT_ENCODING='utf-8'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#定义请求头
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
# Obey robots.txt rules
#设置为不符合robots协议
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#修改廷迟为2秒
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Alibaba.middlewares.AlibabaSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Alibaba.middlewares.AlibabaDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'Alibaba.pipelines.AlibabaPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
4.4.4 修改 pipelines.py
代码如下
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
#导入openpyxl
import openpyxl
#定义AlibabaPippeline类
class AlibabaPipeline(object):
#初始化函数,当类实例化时这个方法会启动
def __init__(self):
#创建工作薄
self.wb = openpyxl.Workbook()
#定位活动表
self.ws = self.wb.active
#用append()向表中添加表头
self.ws.append(['职位','工作地点','招聘要求'])
#默认处理item的方法
def process_item(self, item, spider):
#把岗位、工作地点、招聘要求等信息赋值给line
line = [item['job_title'],item['address'],item['detail']]
# 用append函数将公司名称、职位名称、工作地点和招聘信息都添加进表格
self.ws.append(line)
# 将item丢回给引擎,如果后面还有这个item需要经过的itempipeline,引擎会自己调度
return item
4.5运行Scrapy
运行Scrapy有两种方法。
4.5.1第一种
在pycharm终端转到scrapy项目的文件夹(cd + 文件夹的路径名),然后输入:scrapy crawl alibaba(alibaba就是爬虫的名字)。
scrapy crawl alibaba
4.5.2第二种
在与scrapy.cfg同级下新建一个main.py文件。
代码如下
from scrapy import cmdline
#调用cmdline内的excute()来实现终端运行
cmdline.execute(['scrapy','crawl','AliBaba'])