最近在研究股票数据回测(其实想做量化交易),但是能直接提供数据的API都不太稳定(tushare超时,雅虎的要修复才能用,也不太稳定)
#雅虎股票数据API的修复包
from pandas_datareader import data as pdr
import fix_yahoo_finance
最后还是打算自己学习下python的爬虫,很早就听说过py爬虫的大名,尝试了下 我觉得OK。
import requests
from bs4 import BeautifulSoup
import re
#步骤1: 从东方财富网获取股票列表;
#步骤2: 逐一获取股票代码,并增加到百度股票的链接中,最后对这些链接进行逐个的访问获得股票的信息;
#步骤3: 将结果存储到文件。
def getHTMLText(url, code="utf-8"):
try:
r = requests.get(url)
r.raise_for_status()#抛出异常
r.encoding = code#设定编码格式
return r.text
except:
return ""
def getStockList(lst, stockURL):
html = getHTMLText(stockURL, "GB2312")#只获取htrm文本?
soup = BeautifulSoup(html, 'html.parser') #html解析,到这里把整个网站源代码整理干净
a = soup.find_all('a')#解析页面,找到所有的a标签
for i in a:
#a[1] =要闻
#type(a[1]) = bs4.element.Tag
try:
#找到a标签中的href属性,并且判断属性中间的链接,把链接后面的数字取出来
href = i.attrs['href']
#a[1].attrs['href'] = 'http://finance.eastmoney.com/yaowen.html'
#深圳交易所的代码以sz开头,上海交易所的代码以sh开头,股票的数字有6位构成,所以正则表达式可以写为[s][hz]\d{6}
lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
except:
#try...except来对程序进行异常处理
continue
def getStockInfo(lst, stockURL, fpath):
count = 0
for stock in lst:
url = stockURL + stock + ".html"
html = getHTMLText(url)#对一只股票进行操作
try:
if html=="":
continue
infoDict = {}
soup = BeautifulSoup(html, 'html.parser')
stockInfo = soup.find('div',attrs={'class':'stock-bets'})#find整理成以的整段代码
#
name = stockInfo.find_all(attrs={'class':'bets-name'})[0]#find_all从所有的stockInfo取出name
#
# 基金通乾 (500038)
#
infoDict.update({'股票名称': name.text.split()[0]})
# text取出 ( ) 标签代码以外文本
#
# 股票的其他信息存放在dt和dd标签中,其中dt表示股票信息的键域,dd标签是值域。获取全部的键和值:
keyList = stockInfo.find_all('dt')
valueList = stockInfo.find_all('dd')
for i in range(len(keyList)):
key = keyList[i].text#text可直接在最高 提取
val = valueList[i].text#text可直接在0.94 提取
infoDict[key] = val#值赋到字典的键中
with open(fpath, 'a', encoding='utf-8') as f:
f.write( str(infoDict) + '\n' )
count = count + 1
print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
except:
count = count + 1
print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
continue
def main():
stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
stock_info_url = 'https://gupiao.baidu.com/stock/'
output_file = 'D:/BaiduStockInfo.txt'
slist=[]
getStockList(slist, stock_list_url)
getStockInfo(slist, stock_info_url, output_file)
main()
这里有个巨大的问题,那就是这样写只能爬取1天的数据
不过作为我练习的第一个爬虫程序,我把每个步骤的中间过程都作为注释记录,当作一种笔记学习吧。
接下来是能获取历史数据的代码
import time
import requests
from lxml import etree#
import re
import pandas as pd
class StockCode(object):
def __init__(self):
self.start_url = "http://quote.eastmoney.com/stocklist.html#sh"
self.headers = {
"User-Agent": ":Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
}
def parse_url(self):
# 发起请求,获取响应
response = requests.get(self.start_url, headers=self.headers)
if response.status_code == 200:
return etree.HTML(response.content)
def get_code_list(self, response):
# 得到股票代码的列表
node_list = response.xpath('//*[@id="quotesearch"]/ul[1]/li')
code_list = []
for node in node_list:
try:
code = re.match(r'.*?\((\d+)\)', etree.tostring(node).decode()).group(1)
print (code)
code_list.append(code)
except:
continue
return code_list
def run(self):
html = self.parse_url()
return self.get_code_list(html)
##下载历史交易记录
class Download_HistoryStock(object):
def __init__(self, code):
self.code = code
self.start_url = "http://quotes.money.163.com/trade/lsjysj_" + self.code + ".html"
print (self.start_url)
self.headers = {
"User-Agent": ":Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
}
def parse_url(self):
response = requests.get(self.start_url)
print (response.status_code)
if response.status_code == 200:
return etree.HTML(response.content)
return False
def get_date(self, response):
# 得到开始和结束的日期
start_date = ''.join(response.xpath('//input[@name="date_start_type"]/@value')[0].split('-'))
end_date = ''.join(response.xpath('//input[@name="date_end_type"]/@value')[0].split('-'))
return start_date,end_date
def download(self, start_date, end_date):
download_url = "http://quotes.money.163.com/service/chddata.html?code=0"+self.code+"&start="+start_date+"&end="+end_date+"&fields=TCLOSE;HIGH;LOW;TOPEN;LCLOSE;CHG;PCHG;TURNOVER;VOTURNOVER;VATURNOVER;TCAP;MCAP"
data = requests.get(download_url)
with open('E:/data/historyStock/' + self.code + '.csv', 'wb') as f:
for chunk in data.iter_content(chunk_size=10000):
if chunk:
f.write(chunk)
print ('股票---',self.code,'历史数据正在下载')
def run(self):
try:
html = self.parse_url()
start_date,end_date = self.get_date(html)
self.download(start_date, end_date)
except Exception as e:
print (e)
if __name__ == '__main__':
code = StockCode()
code_list = code.run()
for temp_code in dcodes:
time.sleep(1)
download = Download_HistoryStock(temp_code)
download.run()
后面是一些额外的操作,当作记录
#
code_df=pd.Series(code_list).astype('int')
code_list=code_df[code_df>=600000].astype('str').tolist()
# #断点查找目录下文件名,与code_list做差集
import os
dir = os.fsencode('E:/data/historyStock/')
codes = []
for file in os.listdir(dir):
filename = os.fsdecode(file)
code = str(filename[0:6])
codes.append(code)
dcodes=list(set(code_list).difference(set(codes)))
#读取到本地,写入mysql
dfs=[]
for code in codes:
everydf=pd.read_csv('E:/data/historyStock/%s.csv'%code,
encoding='gbk').sort_values(by = '日期' )
dfs.append(everydf)
stock=pd.concat(dfs)
stock.to_csv('E:/data/Stock.csv')
stock=pd.read_csv('E:/data/Stock.csv',encoding='gbk')
import MySQLdb as mdb
from sqlalchemy import create_engine
#sec_user:password@localhost/securities_master用户:密码@localhost/数据库名
engine = create_engine('mysql://sec_user:password@localhost/securities_master?charset=utf8')#
#存入数据库
stock.to_sql('historystock',engine)