Scrapy框架爬虫案例

Scrapy框架爬虫案例

    • 1 什么是Scrapy
    • 2 Scrapy架构
    • 3 Scrapy架构图
    • 4 案例
      • 4.1爬取职友集中阿里巴巴招聘岗位
      • 4.2 创建Scrapy项目
      • 4.3 定义Item
      • 4.4 编写spiders
        • 4.4.1 创建alibaba.py
        • 4.4.2 编写alibaba.py
        • 4.4.3 修改 settings.py
        • 4.4.4 修改 pipelines.py
      • 4.5运行Scrapy
        • 4.5.1第一种
        • 4.5.2第二种

1 什么是Scrapy

Scrapy是适用于Python的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

2 Scrapy架构

  • Scrapy Engine(引擎)
    负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等。

  • Scheduler(调度器)
    它负责接受引擎发送过来的Request请求,并按照一定的方式进行整理排列,入队,当引擎需要时,交还给引擎。

  • Downloader(下载器)
    负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses交还给Scrapy Engine(引擎),由引擎交给Spider来处理。

  • Spider(爬虫)
    它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给引擎,再次进入Scheduler(调度器)。

  • Item Pipeline(管道)
    它负责处理Spider中获取到的Item,并进行进行后期处理(详细分析、过滤、存储等)的地方。

  • Downloader Middlewares(下载中间件)
    一个可以自定义扩展下载功能的组件。

  • Spider Middlewares(Spider中间件)
    一个可以自定扩展和操作引擎和Spider中间通信的功能组件

3 Scrapy架构图

Scrapy框架爬虫案例_第1张图片

4 案例

4.1爬取职友集中阿里巴巴招聘岗位

网址:职友集阿里招聘网

4.2 创建Scrapy项目

进入PyCharm终端使用scrapy startproject Alibaba[项目名,自己定义]创建自己的scrapy项目
命令如下:

scrapy startproject Alibaba

如图所示
Scrapy框架爬虫案例_第2张图片

4.3 定义Item

进入Alibaba内编辑items.py。代码如下

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

#定义一个item类继承scrapy.Item
class AlibabaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    #定义职位名称数据属性
    job_title = scrapy.Field()

    #定义就职地点数据属性
    address = scrapy.Field()

    #定义招聘要求数据属性
    detail = scrapy.Field( )
    pass

4.4 编写spiders

4.4.1 创建alibaba.py

在PyCharm终端进入Alibaba文件夹内运行scrapy genspider example example.com命令,其中example是为这个爬虫定义的名字,可自定义example.com是爬虫要爬取的网址。
代码如下

scrapy genspider alibaba jobui.com

4.4.2 编写alibaba.py

代码如下

import scrapy

#导入item类
from ..items import AlibabaItem

#导入bs4用于解析数据
import bs4

#定义一个爬虫类,用于继承scrapy.Spider类
class AlibabaSpider(scrapy.Spider):
    #定义爬虫名字,这是唯一属性
    name = 'AliBaba'
    #定义爬虫网络域名,只允许在该域名内爬取
    allowed_domains = ['www.jobui.com']
    #设置爬虫起始爬取的url
    start_urls = ['https://www.jobui.com/company/281097/jobs/p1']
    #使用for遍历网址
    for page in range(1,601):
        url = 'https://www.jobui.com/company/281097/jobs/p{i}'.format(i=page)
        #将网址添加进start_urls内
        start_urls.append(url)

    #parse是默认处理reponse的方法
    def parse(self, response):
        #使用BeautifulSoup解析对象
        bs = bs4.BeautifulSoup(response.text,'html.parser')
        #用find_all提取
标签信息,里面包含所有的招聘信息 all_knowledge = bs.find_all('div',class_="c-job-list") #使用for循环遍历all_knowledge for data in all_knowledge: #实例化AlibabaItem这个类 item = AlibabaItem() #提取招聘岗位信息 item['job_title'] = data.find_all('div',class_="job-segmetation")[0].a.h3.text #提取工作地点信息 item['address'] = data.find_all('div',class_="job-segmetation")[1].find_all('span')[0].text #提取招要求信息 item['detail'] = data.find_all('div',class_="job-segmetation")[1].find_all('span')[1].text #使用yield将item返还给引擎 yield item pass

4.4.3 修改 settings.py

代码如下

# Scrapy settings for Alibaba project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Alibaba'

SPIDER_MODULES = ['Alibaba.spiders']
NEWSPIDER_MODULE = 'Alibaba.spiders'

#导出文件的路径
FEED_URI='%(name)s.csv'
#导出的数据格式
FEED_FORMAT='csv'
#导出文件编码
FEED_EXPORT_ENCODING='utf-8'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#定义请求头
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'

# Obey robots.txt rules
#设置为不符合robots协议
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#修改廷迟为2秒
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Alibaba.middlewares.AlibabaSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Alibaba.middlewares.AlibabaDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'Alibaba.pipelines.AlibabaPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

4.4.4 修改 pipelines.py

代码如下

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
#导入openpyxl
import openpyxl

#定义AlibabaPippeline类
class AlibabaPipeline(object):
    #初始化函数,当类实例化时这个方法会启动
    def __init__(self):
        #创建工作薄
        self.wb = openpyxl.Workbook()

        #定位活动表
        self.ws = self.wb.active

        #用append()向表中添加表头
        self.ws.append(['职位','工作地点','招聘要求'])
    #默认处理item的方法
    def process_item(self, item, spider):
        #把岗位、工作地点、招聘要求等信息赋值给line
        line = [item['job_title'],item['address'],item['detail']]

        # 用append函数将公司名称、职位名称、工作地点和招聘信息都添加进表格
        self.ws.append(line)

        # 将item丢回给引擎,如果后面还有这个item需要经过的itempipeline,引擎会自己调度
        return item

4.5运行Scrapy

运行Scrapy有两种方法。

4.5.1第一种

在pycharm终端转到scrapy项目的文件夹(cd + 文件夹的路径名),然后输入:scrapy crawl alibaba(alibaba就是爬虫的名字)。

scrapy crawl alibaba

4.5.2第二种

在与scrapy.cfg同级下新建一个main.py文件。
代码如下

from scrapy import cmdline
#调用cmdline内的excute()来实现终端运行

cmdline.execute(['scrapy','crawl','AliBaba'])

如图所示
Scrapy框架爬虫案例_第3张图片

你可能感兴趣的:(笔记,python,爬虫)