Python+Pycharm +Scrapy搭建爬虫项目

                                                                       Python+Pycharm +Scrapy搭建爬虫项目

 

  Scrapy简介

  Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中

  Scrapy框架的工作流程图如下:

Python+Pycharm +Scrapy搭建爬虫项目_第1张图片

 Scrapy Engine(引擎):负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等

 Scheduler(调度器):负责接收engine发送过来的Request请求并按照一定的方式进行整理排列,入队,当engine需要时,交还给engine

 Downloader(下载器):负责下载engine 发送的所有Requests请求,并将其获取到的Responses交还给Engine,由Engine交给Spider处理

 Spider(爬虫):负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给engine,将URL提交给engine,再次进入Scheduler

 Item Pipeline(管道):负责处理Spider中获取的Item,并经行后期处理(详细分析、过滤、存储等)的地方

 Downloader Middlewares(下载中间件):自定义扩展下载功能组件,可以进行服务器代理等设置

 Spider Middlewares(Spider中间件):可以自定义扩展和操作engine和Spider中间 通信的功能组件(比如进入Spider 的Responses,和从Spider出去的Requests)

 

一、准备工作

1.安装python3.x

2.下载PyCharm Community

3.安装Scrapy:安装好Python后,在cmd中输入以下命令   pip install scrapy 

 

二、搭建步骤

1.创建一个爬虫项目:通过scrapy startproject命令创建 

在cmd中运行命令:scrapy, 出现下图命令说明

Python+Pycharm +Scrapy搭建爬虫项目_第2张图片

执行 scrapy startproject [项目名],会在当前目录创建一个Scrapy项目

Python+Pycharm +Scrapy搭建爬虫项目_第3张图片

 

查看创建的scrapy项目的目录结构如下(拿以下项目举例):

Python+Pycharm +Scrapy搭建爬虫项目_第4张图片

①WebScraping项目根目录下包括一个同名的WebScraping包和一个scrapy.cfg配置文件;其中scrapy.cfg配置文件内容如下:

    指定该scrapy项目的setting文件为WebScraping包下的settings.py文件

Python+Pycharm +Scrapy搭建爬虫项目_第5张图片

②scrapTest模块下又包含了items、middlewares、pipelines、settings模块以及spider包

    (1) items模块中定义了items类,各items类必须继承scrapy.Item;通过scrapy.Field()定义各Item类中的类变量

import scrapy

class StockQuotationItem(scrapy.Item):
    '''

    '''
    order=scrapy.Field()
    symbol = scrapy.Field()
    instrument_name = scrapy.Field()
    price=scrapy.Field()
    pchg = scrapy.Field()
    chg = scrapy.Field()
    speed_up = scrapy.Field()
    turnover = scrapy.Field()
    QR = scrapy.Field()
    swing = scrapy.Field()
    vol = scrapy.Field()
    floating_shares = scrapy.Field()
    floating__net_value=scrapy.Field()
    PE = scrapy.Field()

 

    (2) middlewares模块中定义了各中间件类,包括SpiderMiddleWares、DownloadMiddleWares等

    (3) pipelines模块,用于处理spider中获取的items(将获取的items保存至文件或者数据库等):

        Pipeline类必须实现process_item()方法

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class WebScrapingPipeline(object):
    def __init__(self):
        self.f= open('spiderResult.json','wb')

    def process_item(self, item, spider):
        result=json.dumps(dict(item), ensure_ascii=False)+', \n'
        self.f.write(result.encode('utf-8'))

        #将Item返回至引擎,告知该item已经处理完成,你可以给我下一个item
        return item

    def close_spider(self,spider):
        self.f.close()

    (4) settings模块中包含了项目相关配置信息,包括指定SPIDER_MODULES,指定ITEM_PIPELINES等等;如果要使用pipelines模块中定义的各pipelines类,必须在settings模块中指定,格式如下:

ITEM_PIPELINES = {
   'WebScraping.pipelines.WebScrapingPipeline': 300,
}

    下面显示了一个settings模块中包含的内容:默认设置了当前spider模块的位置以及新建spider模块的位置

# -*- coding: utf-8 -*-

# Scrapy settings for $project_name project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'WebScraping'

SPIDER_MODULES = ['WebScraping.spiders']
NEWSPIDER_MODULE = 'WebScraping.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = '$project_name (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    '$project_name.middlewares.${ProjectName}SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    '$project_name.middlewares.${ProjectName}DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'WebScraping.pipelines.WebScrapingPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

    (5) spider包中包含各spider模块,spider模块中定义了各spider类:

        其中spider类继承了scrapy.Spider类或者scrapy.CrawlSpider,通过parse方法对Response内容进行处理,获取Item字段需要的数据,并将需要跟进的URL提交给engine

        在spider类中默认需要定义该spider的名称、start_urls、allowed_domains等内容,其中spider name 必须定义,后续运行spider时,需要指定spider名称

import scrapy
from ..items import StockQuotationItem

class WebScrapingSpider(scrapy.Spider):
    name='WebScraping'
    allowed_domains=['q.10jqka.com.cn']
    start_urls=['http://q.10jqka.com.cn/']

    def parse(self, response):
        quotation_tb=response.xpath('//*[@id="maincont"]/table/tbody')
        quotation_ls=quotation_tb.xpath('tr')
        item=StockQuotationItem()
        for quotation in quotation_ls:
            result=quotation.xpath('td/text()').extract()
            #item['order']=order
            item['symbol']=result[0]
            item['instrument_name']=result[1]
            item['price']=result[2]
            item['pchg']=result[3]
            item['chg']=result[4]
            item['speed_up']=result[5]
            item['turnover']=result[6]
            item['QR']=result[7]
            item['swing'] = result[8]
            item['vol'] = result[9]
            item['floating_shares'] = result[10]
            item['floating__net_value'] = result[11]
            #item['PE'] = result[12]
            yield item

 

2.运行爬虫项目:通过scrapy crawl  [spider 名称] 指定运行某个spider

  下面是 通过 scrapy crawl 命令运行 "WebScraping"的结果

F:\pyworkspace\WebScraping>scrapy crawl WebScraping
2019-02-15 14:59:03 [scrapy.utils.log] INFO: Scrapy 1.5.2 started (bot: WebScraping)
2019-02-15 14:59:03 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Int
el)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, Platform Windows-10-10.0.17763-SP0
2019-02-15 14:59:03 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'WebScraping', 'CONCURRENT_REQUESTS': 32, 'NEWSPIDER_MODULE': 'WebScraping.spiders', 'SPIDER_MODULES': ['WebScraping.spiders']}
2019-02-15 14:59:03 [scrapy.extensions.telnet] INFO: Telnet Password: 7c6e57a3e25c172e
2019-02-15 14:59:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-02-15 14:59:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-15 14:59:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-15 14:59:07 [scrapy.middleware] INFO: Enabled item pipelines:
['WebScraping.pipelines.WebScrapingPipeline']
2019-02-15 14:59:07 [scrapy.core.engine] INFO: Spider opened
2019-02-15 14:59:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-15 14:59:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-15 14:59:07 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '--',
 'chg': '--',
 'floating__net_value': '14.53',
 'floating_shares': '4.32亿亿',
 'instrument_name': '2.88',
 'pchg': '0.88',
 'price': '44.00',
 'speed_up': '0.07',
 'swing': '32.38万万',
 'symbol': '1',
 'turnover': '--',
 'vol': '1.50亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '10.81',
 'chg': '--',
 'floating__net_value': '--',
 'floating_shares': '30.04亿亿',
 'instrument_name': '2.04',
 'pchg': '0.19',
 'price': '10.27',
 'speed_up': '1.74',
 'swing': '5202.49万万',
 'symbol': '2',
 'turnover': '3.55',
 'vol': '14.73亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '13.07',
 'chg': '--',
 'floating__net_value': '--',
 'floating_shares': '38.23亿亿',
 'instrument_name': '1.94',
 'pchg': '0.18',
 'price': '10.23',
 'speed_up': '13.44',
 'swing': '4.99亿亿',
 'symbol': '3',
 'turnover': '3.07',
 'vol': '19.71亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '11.08',
 'chg': '--',
 'floating__net_value': '64.88',
 'floating_shares': '68.52亿亿',
 'instrument_name': '3.58',
 'pchg': '0.33',
 'price': '10.15',
 'speed_up': '8.04',
 'swing': '5.35亿亿',
 'symbol': '4',
 'turnover': '3.55',
 'vol': '19.14亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '--',
 'chg': '--',
 'floating__net_value': '--',
 'floating_shares': '48.70亿亿',
 'instrument_name': '4.57',
 'pchg': '0.42',
 'price': '10.12',
 'speed_up': '4.06',
 'swing': '1.98亿亿',
 'symbol': '5',
 'turnover': '1.10',
 'vol': '10.66亿 亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '5.99',
 'chg': '--',
 'floating__net_value': '--',
 'floating_shares': '20.99亿亿',
 'instrument_name': '2.94',
 'pchg': '0.27',
 'price': '10.11',
 'speed_up': '4.81',
 'swing': '1.00亿亿',
 'symbol': '6',
 'turnover': '1.30',
 'vol': '7.14亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '11.91',
 'chg': '--',
 'floating__net_value': '25.64',
 'floating_shares': '65.81亿亿',
 'instrument_name': '3.05',
 'pchg': '0.28',
 'price': '10.11',
 'speed_up': '5.21',
 'swing': '3.30亿亿',
 'symbol': '7',
 'turnover': '1.56',
 'vol': '21.58亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '--',
 'chg': '--',
 'floating__net_value': '54.70',
 'floating_shares': '15.15亿亿',
 'instrument_name': '5.78',
 'pchg': '0.53',
 'price': '10.10',
 'speed_up': '2.76',
 'swing': '4178.04万万',
 'symbol': '8',
 'turnover': '1.01',
 'vol': '2.62亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '7.12',
 'chg': '--',
 'floating__net_value': '--',
 'floating_shares': '16.32亿亿',
 'instrument_name': '3.71',
 'pchg': '0.34',
 'price': '10.09',
 'speed_up': '14.20',
 'swing': '2.29亿亿',
 'symbol': '9',
 'turnover': '6.75',
 'vol': '4.40亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '10.91',
 'chg': '--',
 'floating__net_value': '18.56',
 'floating_shares': '64.16亿亿',
 'instrument_name': '6.66',
 'pchg': '0.61',
 'price': '10.08',
 'speed_up': '2.89',
 'swing': '1.82亿亿',
 'symbol': '10',
 'turnover': '2.26',
 'vol': '9.63亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '5.34',
 'chg': '--',
 'floating__net_value': '43.85',
 'floating_shares': '42.55亿亿',
 'instrument_name': '5.57',
 'pchg': '0.51',
 'price': '10.08',
 'speed_up': '4.60',
 'swing': '1.92亿亿',
 'symbol': '11',
 'turnover': '1.83',
 'vol': '7.64亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '11.79',
 'chg': '--',
 'floating__net_value': '--',
 'floating_shares': '37.79亿亿',
 'instrument_name': '5.79',
 'pchg': '0.53',
 'price': '10.08',
 'speed_up': '3.10',
 'swing': '1.14亿亿',
 'symbol': '12',
 'turnover': '1.17',
 'vol': '6.53亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '10.35',
 'chg': '--',
 'floating__net_value': '127.71',
 'floating_shares': '61.37亿亿',
 'instrument_name': '3.83',
 'pchg': '0.35',
 'price': '10.06',
 'speed_up': '3.21',
 'swing': '1.93亿亿',
 'symbol': '13',
 'turnover': '1.50',
 'vol': '16.02亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '10.05',
 'chg': '--',
 'floating__net_value': '18.46',
 'floating_shares': '24.50亿亿',
 'instrument_name': '8.65',
 'pchg': '0.79',
 'price': '10.05',
 'speed_up': '6.49',
 'swing': '1.55亿亿',
 'symbol': '14',
 'turnover': '2.31',
 'vol': '2.83亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '10.83',
 'chg': '--',
 'floating__net_value': '34.66',
 'floating_shares': '53.08亿亿',
 'instrument_name': '7.01',
 'pchg': '0.64',
 'price': '10.05',
 'speed_up': '3.86',
 'swing': '1.99亿亿',
 'symbol': '15',
 'turnover': '1.95',
 'vol': '7.57亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '11.47',
 'chg': '--',
 'floating__net_value': '--',
 'floating_shares': '24.67亿亿',
 'instrument_name': '3.07',
 'pchg': '0.28',
 'price': '10.04',
 'speed_up': '9.86',
 'swing': '2.36亿亿',
 'symbol': '16',
 'turnover': '2.67',
 'vol': '8.04亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '11.26',
 'chg': '--',
 'floating__net_value': '21.31',
 'floating_shares': '30.28亿亿',
 'instrument_name': '9.87',
 'pchg': '0.90',
 'price': '10.03',
 'speed_up': '9.02',
 'swing': '2.66亿亿',
 'symbol': '17',
 'turnover': '5.65',
 'vol': '3.07亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '10.03',
 'chg': '--',
 'floating__net_value': '18.77',
 'floating_shares': '8.81亿亿',
 'instrument_name': '19.86',
 'pchg': '1.81',
 'price': '10.03',
 'speed_up': '20.85',
 'swing': '1.79亿亿',
 'symbol': '18',
 'turnover': '3.13',
 'vol': '4435.20万万'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '13.16',
 'chg': '--',
 'floating__net_value': '313.37',
 'floating_shares': '25.81亿亿',
 'instrument_name': '16.13',
 'pchg': '1.47',
 'price': '10.03',
 'speed_up': '10.18',
 'swing': '2.52亿亿',
 'symbol': '19',
 'turnover': '3.47',
 'vol': '1.60亿亿'}
2019-02-15 14:59:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://q.10jqka.com.cn/>
{'QR': '1.85',
 'chg': '--',
 'floating__net_value': '39.20',
 'floating_shares': '41.54亿亿',
 'instrument_name': '4.17',
 'pchg': '0.38',
 'price': '10.03',
 'speed_up': '3.62',
 'swing': '1.50亿亿',
 'symbol': '20',
 'turnover': '4.03',
 'vol': '9.96亿亿'}
2019-02-15 14:59:07 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-15 14:59:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 214,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 6457,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 2, 15, 6, 59, 7, 665701),
 'item_scraped_count': 20,
 'log_count/DEBUG': 22,
 'log_count/INFO': 8,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 2, 15, 6, 59, 7, 122171)}
2019-02-15 14:59:07 [scrapy.core.engine] INFO: Spider closed (finished)

至此,一个简单爬虫项目就搭建完成

 

 

你可能感兴趣的:(python,scrapy)