python 股票数据爬取(两种方法)

股票HTML网页分析:

东方财富网可以看到股票信息:

http://quote.eastmoney.com/stocklist.html

查看源代码:

  • R001(201008)
  • R004(201010)
  • RC001(202001)

  • 可以在href中提取股票的代码,
    想了解股票的具体信息,需要去百度股票查找,方法为:
    'https://gupiao.baidu.com/stock/股票代码.html
    查看具体股票的源代码:

    国祯环保 (300388) 已休市 2017-09-29 15:00:03

    -- -- --
    今开
    19.92
    成交量
    8917手
    最高
    20.15
    涨停
    21.96
    内盘
    4974手
    成交额
    1786.10万
    委比
    -50.69%
    流通市值
    59.98亿
    市盈率MRQ
    50.59
    每股收益
    0.20
    总股本
    3.06亿
    昨收
    19.96
    换手率
    0.30%
    最低
    19.92
    跌停
    17.96
    外盘
    3943手
    振幅
    1.15%
    量比
    0.11
    总市值
    61.35亿
    市净率
    3.91
    每股净资产
    5.14
    流通股本
    2.99亿
    发现股票名称在class="bets-name"的a标签中,其他的数据都在dt和dd标签中


    方法一:采用bs4库和正则表达式

    import requests
    from bs4 import BeautifulSoup
    import re
    
    #优化,可以减少程序判断编码所花费的时间
    def getHTMLText(url, code='UTF-8'):
        try:
            r = requests.get(url)
            r.raise_for_status()
            r.encoding = code
            return r.text
        except:
            return ""
    
    
    def getStockList(url, stockList):
        html = getHTMLText(url, 'GB2312')
        soup = BeautifulSoup(html, 'html.parser')
        aInformaton = soup.find_all('a')
        for ainfo in aInformaton:
            try:
                stockList.append(re.findall(r'[s][hz]\d{6}', ainfo.attrs['href'])[0])
            except:
                continue
    
    
    def getStockInformation(detailUrl, outputFile, stockList):
        count = 0
        for name in stockList:
            count = count + 1
            stockUrl = detailUrl + name + '.html'
            html = getHTMLText(stockUrl)
            try:
                if html == "":
                    continue
                stockDict = {}
                soup = BeautifulSoup(html, 'html.parser')
                stockinfo = soup.find('div', attrs={'class': 'stock-bets'})
                stockname = stockinfo.find('a', attrs={'class': 'bets-name'})
                # 当标签内部还有标签时,利用text可以得到正确的文字,利用string可能会产生None
                stockDict["股票名称"] = stockname.text.split()[0]
                stockKey = stockinfo.find_all('dt')
                stockValue = stockinfo.find_all('dd')
                for i in range(len(stockKey)):
                    stockDict[stockKey[i].string] = stockValue[i].string
                #\r移动到行首,end=""不进行换行
                print("\r{:5.2f}%".format((count / len(stockList) * 100)), end='')
                #追加写模式'a'
                f = open(outputFile, 'a')
                f.write(str(stockDict) + '\n')
                f.close()
            except:
                print("{:5.2f}%".format((count / len(stockList) * 100)), end='')
                continue
    
    
    def main():
        listUrl = 'http://quote.eastmoney.com/stocklist.html'
        detailUrl = 'https://gupiao.baidu.com/stock/'
        outputFile = 'C:/Users/Administrator/Desktop/out.txt'
        stockList = []
        getStockList(listUrl, stockList)
        getStockInformation(detailUrl, outputFile, stockList)
    main()
    
    


    方法2.采用Scrapy框架和正则表达式库

    (1)建立工程和Spider模板(保存为stocks.py文件)

    在命令行中进入:E:\PythonProject\BaiduStocks

    输入:scrapy startproject BaiduStocks   建立了scrapy工程

    输入:scrapy genspider stocks baidu.com 建立spider模板,baidu.com是指爬虫限定的爬取域名,在stocks.py文件删去即可


    (2)编写spider爬虫(即stocks.py文件)

    采用css选择器,可以返回选择的标签元素,通过方法extract()可以提取标签元素为字符串从而实现匹配正则表达式的处理

    正则表达式详解:

    
                国祯环保 (300388)
                

    re.findall('.*\(', stockname)[0].split()[0] + '('+re.findall('\>.*\<', stockname)[0][1:-1]+')'

    匹配结果:国祯环保(300388)

     
      

    因为'('为正则表达式语法里的基本符号,所以需要转义

    正则表达式从每行开始匹配,匹配之后返回['            国祯环保 ('],采用split将空白字符分割,返回['国祯环保',‘(’]

    # -*- coding: utf-8 -*-
    import scrapy
    import re
    
    class StocksSpider(scrapy.Spider):
        name = 'stocks'
        start_urls = ['http://quote.eastmoney.com/stocklist.html']
    
        def parse(self, response):
            fo=open(r'E:\PythonProject\BaiduStocks\oo.txt','a')
            #fo.write(str(response.css('a').extract()))
            count=0
            for href in response.css('a').extract():
                try:
                    if count == 300:
                        break
                    count=count+1
                    stockname=re.findall(r'[s][hz]\d{6}',href)[0]
                    stockurl='https://gupiao.baidu.com/stock/' + stockname + '.html'
                    #fo.write(stockurl)
                    yield scrapy.Request(url= stockurl,headers={"User-Agent":"Chrome/10"} ,callback=self.stock_parse)
                except:
                    continue
            pass
    
    
        def stock_parse(self,response):
            ffo=open(r'E:\PythonProject\BaiduStocks\stockparse.txt','a')
            stockDict={}
            #提取标签中class="stock-bets"的标签元素
            stockinfo=response.css('.stock-bets')
            #将提取出来的标签转化为字符串列表,然后取第一个
            stockname=stockinfo.css('.bets-name').extract()[0]
            #ffo.write(stockname)
            keyList=stockinfo.css('dt').extract()
            #ffo.write(str(keyList))
            valueList=stockinfo.css('dd').extract()
            stockDict['股票名称'] = re.findall('.*\(', stockname)[0].split()[0] + '('+re.findall('\>.*\<', stockname)[0][1:-1]+')'
            for i in range(len(keyList)):
                stockkey=re.findall(r'>.*',keyList[i])[0][1:-5]
                stockvalue=re.findall(r'>.*',valueList[i])[0][1:-5]
                stockDict[stockkey]=stockvalue
            yield stockDict
    
    
    
    


    (3)编写PipeLine(即pipelines.py文件)

    系统自动生成了Item处理类BaiduStocksPipeline,我们不采用系统生成,新建一个BaiduStocksinfoPipeline类,并书写Item处理函数

    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class BaidustocksPipeline(object):
        def process_item(self, item, spider):
            return item
    
    
    class BaidustocksinfoPipeline(object):
        #爬虫打开时执行
        def open_spider(self,spider):
            self.f=open(r'E:\PythonProject\BaiduStocks\BaiduStocks\asdqwe.txt','a')
    
        # 爬虫关闭时执行
        def close_spider(self,spider):
            self.f.close()
            
        #处理Item项
        def process_item(self,item,spider):
            try:
                self.f.write(str(item)+'\n')
            except:
                pass
            return item
    


    此时要修改配置文件setting.py文件

    ITEM_PIPELINES = {
        'BaiduStocks.pipelines.BaidustocksinfoPipeline': 300,
    }



    (4)运行爬虫:scrapy crawl stocks

    你可能感兴趣的:(爬虫程序实例)