python爬虫:Scrapy框架爬取纳斯达克(NASDAQ)股票数据

思路:
1、纳斯达克网站 https://www.nasdaq.com;
2、以亚马逊AMZN普通股票为例,找到股票历史数据详情页url=https://www.nasdaq.com/market-activity/stocks/amzn/historical;
3、网站采用动态加载数据,使用Selenium获取 AMZN 5年股票历史数据;通过webdriver的click点击进行翻页,保存数据;
4、xpath解析数据,保存至csv文件。

一、准备工作

创建一个scrapy project:

 scrapy startproject STOCK

创建spider file

 scrapy genspider nasdaq nasdaq.com

二、构建框架
(1)items.py / 定义item

import scrapy

class StockItem(scrapy.Item):
    DATE = scrapy.Field()
    CLOSE = scrapy.Field()
    VOLUME= scrapy.Field()
    OPEN = scrapy.Field()
    HIGH = scrapy.Field()
    LOW = scrapy.Field()

(2) middlewares.py
改写下载中间件,使用webdriver发送请求

import scrapy
from selenium import webdriver
import time
import json

class ChormeMiddleware(object):

    def process_request(self, request, spider):
        #1、获取url
        url = request.url
        #2、设置driver对象
        driver = webdriver.Chrome()
        #3、driver发生get请求
        driver.get(url)
        #需要动态翻页,创建一个列表,把所有页的数据装进去
        data_list = []
        #5年数据共70页,使用for-in循环
        for i in range(70):
            if i == 0:
            	#第一页默认为一个月内数据,选中“5Y”button,跳转进入五年数据模式
                driver.find_element_by_xpath('.//div[@class="table-tabs__list"]/button[5]').click()
                #外网加载速度比较慢,手动阻塞,以免响应不及时
                time.sleep(10)
                #保存第一页数据,放入列表中
                data = driver.page_source
                data_list.append(data)
            if i >= 1:
            #从第二页开始循环点击下一页按钮,将每一页的数据保存至列表中
                driver.find_element_by_xpath('.//button[@class="pagination__next"]').click()
                time.sleep(5)
                data=driver.page_source
                data_list.append(data)
		#5、全部保存完毕后关闭浏览器
        driver.quit()
		#6、返回scrapy框架需要的响应对象
		#这里查看源码,http下的response,HtmlResponse继承父类TextResponse,默认编码ASCII需要改为“utf-8”
       	#查看父类response,需要的参数有:(url, status=200, headers=None, body=b'', flags=None, request=None)
       	#重点-->将保存好的列表转为字符串,并编码为二进制格式
        return scrapy.http.HtmlResponse(url=url,
                                        status=200,
                                        body=json.dumps(data_list).encode('utf-8'),
                                        encoding='utf-8')

(3) spider.py

import scrapy
import json
from lxml import etree
from STOCK.items import StockItem

class NasdaqSpider(scrapy.Spider):
    name = 'nasdaq'
    allowed_domains = ['nasdaq.com']
    start_urls = ['https://www.nasdaq.com/market-activity/stocks/amzn/historical']

    def parse(self, response):
        #创建item
        item = StockItem()
        #将json字符串格式的数据转为列表,方便遍历
        data_list = json.loads(response.body)
        for i in data_list:
            #进行转义,使用xpath解析数据
            data = etree.HTML(i)
            #得到每页数据列表,遍历取出各行数据
            stock_list = data.xpath('.//tbody[@class="historical-data__table-body"]/tr')
            for j in stock_list:
                item['DATE']= j.xpath('./th[1]/text()')[0]
                #股票价格为了方便后面分析数据,这里直接都去掉了单位符号$
                item['CLOSE'] = j.xpath('./td[1]/text()')[0][1:]
                item['VOLUME'] = j.xpath('./td[2]/text()')[0]
                item['OPEN'] = j.xpath('./td[3]/text()')[0][1:]
                item['HIGH'] = j.xpath('./td[4]/text()')[0][1:]
                item['LOW'] = j.xpath('./td[5]/text()')[0][1:]
                #print(item['DATE'],item['CLOSE'])
                yield item

(4)pipelines.py

from scrapy.exporters import CsvItemExporter
#保存为csv格式
class StockPipeline(object):
    def open_spider(self,spider):
        self.file = open('AMZN_STOCK_5Y.csv','wb')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()
    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
    def close_spider(self,spider):
        self.exporter.finish_exporting()
        self.file.close()

(5)设置setting
一般写好一部分代码就开启相应的设置,以防忘记

BOT_NAME = 'STOCK'

SPIDER_MODULES = ['STOCK.spiders']
NEWSPIDER_MODULE = 'STOCK.spiders'

LOG_FILE = 'stock.log'
LOG_LEVEL = 'WARNING'

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36'
ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3
DOWNLOADER_MIDDLEWARES = {
   'STOCK.middlewares.ChormeMiddleware': 543,
}
ITEM_PIPELINES = {
   'STOCK.pipelines.StockPipeline': 300,
}

三、运行spider

scrapy crawl nasdaq

四、完成以上几步即可等待查看数据啦
python爬虫:Scrapy框架爬取纳斯达克(NASDAQ)股票数据_第1张图片python爬虫:Scrapy框架爬取纳斯达克(NASDAQ)股票数据_第2张图片

你可能感兴趣的:(爬虫记录)