开发网络爬虫在东方财富、新浪财经或者纳斯达克等财经网站上爬取一只股票的每天的开盘价,收盘价,最高价,最低价等信息,并存储在数据库中,并开发GUI应用可视化。
(1) 掌握网络爬虫的开发方法;
(2) 掌握Python开发数据库的GUI界面;
(3) 掌握Matplotlib绘制股票的K线图;
(1) 网络爬虫的基本知识;
(2) 利用正则表达式对网页信息提取;
(3) 数据库的访问和表中的数据操作;
(4) Matplotlib库的使用。
(1) 网络爬虫框架的使用;
(2) 正则表达式的使用;
(3) 数据库存储数据。
(1) 利用正则表达式根据网页中的信息组织方式提取数据;
(2) K线图的展现。
def getHtml(stack_code):
data = requests.get("http://quotes.money.163.com/trade/lsjysj_" +
stack_code + ".html#06f01",
headers={
"user-agent": "Mozilla/5.0"})
# 获得一个请求得到的静态网页
return data.text
爬取的网站是网易财经的股票历史交易记录。经过观察发现,不同股票历史交易记录的区别在于中间的一段股票代码,于是通过输入股票代码来获取指定股票的交易网页。
为了便于下面的数据处理,直接返回网页的text,省去了一步处理过程。(json.dumps应该也可以获取内容,作用是把str数据直接转换成字典,但是对数据有格式要求)
json的函数dumps可以直接把网页转化成字典,但是可以自己实现的还是自己写。
def getData(data):
month_data = []
soup = BeautifulSoup(data, "html.parser")
# 爬取网页的工具
list = soup.find("div", class_="inner_box").find("table", class_="table_bg001 border_box limit_sale")
# 看html,已经把不需要的都删了,数据在innerbox的table_bg001 border_box limit_sale下面
# print(list)
dataList = list.find_all("tr")[1:]
# 因为第一个tr后面是一堆th标签,不需要
# print(dataList)
f = open("data.txt", "a+", encoding="utf-8")
for item in dataList:
kv = {
}
if isinstance(item, bs4.element.Tag):
tdList = item.find_all("td")
# print(tdList)
kv["日期"] = tdList[0].text # 去掉td和/td,取中间的内容
kv["开盘价"] = tdList[1].text
kv["最高价"] = tdList[2].text
kv["最低价"] = tdList[3].text
kv["收盘价"] = tdList[4].text
kv["成交量"] = tdList[7].text
month_data.append(kv)
to_write = json.dumps(kv)
f.write(to_write)
f.close()
return month_data
首先通过BeautifulSoup得到网页格式处理的soup,通过浏览器访问防止反爬虫。
<div class="inner_box">
<div class="search_area align_r">
<form id="date" action="/trade/lsjysj_300153.html">
<select name="year">
<option value="2021" selected>2021option><option value="2020" >2020option><option value="2019" >2019option><option value="2018" >2018option><option value="2017" >2017option><option value="2016" >2016option><option value="2015" >2015option><option value="2014" >2014option><option value="2013" >2013option><option value="2012" >2012option><option value="2011" >2011option><option value="2010" >2010option> select>
<select name="season">
<option value="4" >四季度option><option value="3" >三季度option><option value="2" selected>二季度option><option value="1" >一季度option> select>
<input type="submit" value="查询" class="search_btn"/>
<a href="" class="download_link" id="downloadData">下载数据a>
form>
div>
<table class="table_bg001 border_box limit_sale">
<thead>
<tr class="dbrow">
<th>日期th>
<th>开盘价th>
<th>最高价th>
<th>最低价th>
<th>收盘价th>
<th>涨跌额th>
<th>涨跌幅(%)th>
<th>成交量(手)th>
<th>成交金额(万元)th>
<th>振幅(%)th>
<th>换手率(%)th>
tr>
thead>
<tr class=''>
<td>2021-05-14td>
<td class='cRed'>6.94td>
<td class='cRed'>8.10td>
<td class='cRed'>6.78td>
<td class='cRed'>8.10td>
<td class='cRed'>1.35td>
<td class='cRed'>20.00td>
<td>489,575td>
<td>36,666td>
<td>19.56td>
<td>15.30td>
tr>
<tr class='dbrow'>
<td>2021-05-13td><td class='cGreen'>7.03td><td class='cRed'>7.18td><td class='cGreen'>6.74td><td class='cGreen'>6.75td><td class='cGreen'>-0.28td><td class='cGreen'>-3.98td><td>291,380td><td>20,160td><td>6.26td><td>9.11td>tr><tr class=''><td>2021-05-12td><td class='cGreen'>6.40td><td class='cRed'>7.36td><td class='cGreen'>6.26td><td class='cRed'>7.03td><td class='cRed'>0.54td><td class='cRed'>8.32td><td>429,979td><td>30,069td><td>16.95td><td>13.44td>tr><tr class='dbrow'><td>2021-05-11td><td class='cGreen'>6.53td><td class='cGreen'>6.58td><td class='cGreen'>6.25td><td class='cGreen'>6.49td><td class='cGreen'>-0.15td><td class='cGreen'>-2.26td><td>207,696td><td>13,307td><td>4.97td><td>6.49td>tr><tr class=''><td>2021-05-10td><td class='cGreen'>6.58td><td class='cRed'>7.06td><td class='cGreen'>6.48td><td class='cGreen'>6.64td><td class='cGreen'>-0.12td><td class='cGreen'>-1.78td><td>275,209td><td>18,406td><td>8.58td><td>8.60td>tr><tr class='dbrow'><td>2021-05-07td><td class='cRed'>7.10td><td class='cRed'>7.15td><td class='cGreen'>6.56td><td class='cGreen'>6.76td><td class='cGreen'>-0.25td><td class='cGreen'>-3.57td><td>319,005td><td>21,693td><td>8.42td><td>9.97td>tr><tr class=''><td>2021-05-06td><td class='cRed'>6.73td><td class='cRed'>7.35td><td class='cGreen'>6.60td><td class='cRed'>7.01td><td class='cRed'>0.41td><td class='cRed'>6.21td><td>426,529td><td>29,506td><td>11.36td><td>13.33td>tr><tr class='dbrow'><td>2021-04-30td><td class='cRed'>6.97td><td class='cRed'>7.30td><td class='cGreen'>6.27td><td class='cGreen'>6.60td><td class='cGreen'>-0.06td><td class='cGreen'>-0.90td><td>529,436td><td>35,333td><td>15.47td><td>16.54td>tr><tr class=''><td>2021-04-29td><td class='cGreen'>5.50td><td class='cRed'>6.66td><td class='cGreen'>5.41td><td class='cRed'>6.66td><td class='cRed'>1.11td><td class='cRed'>20.00td><td>382,602td><td>23,741td><td>22.52td><td>11.96td>tr><tr class='dbrow'><td>2021-04-28td><td class='cGreen'>5.64td><td class='cRed'>5.86td><td class='cGreen'>5.53td><td class='cGreen'>5.55td><td class='cGreen'>-0.27td><td class='cGreen'>-4.64td><td>196,804td><td>11,068td><td>5.67td><td>6.15td>tr><tr class=''><td>2021-04-27td><td class='cGreen'>6.06td><td class='cGreen'>6.14td><td class='cGreen'>5.65td><td class='cGreen'>5.82td><td class='cGreen'>-0.44td><td class='cGreen'>-7.03td><td>286,392td><td>16,754td><td>7.83td><td>8.95td>tr><tr class='dbrow'><td>2021-04-26td><td class='cGreen'>6.28td><td class='cGreen'>6.48td><td class='cGreen'>5.89td><td class='cGreen'>6.26td><td class='cGreen'>-0.39td><td class='cGreen'>-5.86td><td>397,512td><td>24,395td><td>8.87td><td>12.42td>tr><tr class=''><td>2021-04-23td><td class='cRed'>6.38td><td class='cRed'>7.20td><td class='cRed'>6.18td><td class='cRed'>6.65td><td class='cRed'>0.60td><td class='cRed'>9.92td><td>593,691td><td>40,159td><td>16.86td><td>18.55td>tr><tr class='dbrow'><td>2021-04-22td><td class='cGreen'>5.03td><td class='cRed'>6.05td><td class='cGreen'>5.03td><td class='cRed'>6.05td><td class='cRed'>1.01td><td class='cRed'>20.04td><td>349,405td><td>19,949td><td>20.24td><td>10.92td>tr><tr class=''><td>2021-04-21td><td class='cRed'>5.15td><td class='cRed'>5.15td><td class='cGreen'>5.03td><td class='cGreen'>5.04td><td class='cGreen'>-0.08td><td class='cGreen'>-1.56td><td>28,485td><td>1,444td><td>2.34td><td>0.89td>tr><tr class='dbrow'><td>2021-04-20td><td class='cGreen'>5.17td><td class='cRed'>5.24td><td class='cGreen'>5.12td><td class='cGreen'>5.12td><td class='cGreen'>-0.07td><td class='cGreen'>-1.35td><td>38,135td><td>1,976td><td>2.31td><td>1.19td>tr><tr class=''><td>2021-04-19td><td class='cGreen'>5.15td><td class='cRed'>5.25td><td class='cGreen'>5.12td><td class='cRed'>5.19td><td class='cRed'>0.03td><td class='cRed'>0.58td><td>41,940td><td>2,178td><td>2.52td><td>1.31td>tr><tr class='dbrow'><td>2021-04-16td><td class='cGreen'>4.97td><td class='cRed'>5.19td><td class='cGreen'>4.95td><td class='cRed'>5.16td><td class='cRed'>0.19td><td class='cRed'>3.82td><td>42,536td><td>2,172td><td>4.83td><td>1.33td>tr><tr class=''><td>2021-04-15td><td class='cGreen'>5.04td><td class='cGreen'>5.05td><td class='cGreen'>4.94td><td class='cGreen'>4.97td><td class='cGreen'>-0.10td><td class='cGreen'>-1.97td><td>30,343td><td>1,514td><td>2.17td><td>0.95td>tr><tr class='dbrow'><td>2021-04-14td><td class='cRed'>4.91td><td class='cRed'>5.08td><td class='cGreen'>4.87td><td class='cRed'>5.07td><td class='cRed'>0.18td><td class='cRed'>3.68td><td>31,385td><td>1,559td><td>4.29td><td>0.98td>tr><tr class=''><td>2021-04-13td><td class='cRed'>5.04td><td class='cRed'>5.05td><td class='cGreen'>4.82td><td class='cGreen'>4.89td><td class='cGreen'>-0.11td><td class='cGreen'>-2.20td><td>34,050td><td>1,669td><td>4.60td><td>1.06td>tr><tr class='dbrow'><td>2021-04-12td><td class='cGreen'>5.09td><td class='cGreen'>5.12td><td class='cGreen'>4.98td><td class='cGreen'>5.00td><td class='cGreen'>-0.12td><td class='cGreen'>-2.34td><td>37,899td><td>1,901td><td>2.73td><td>1.18td>tr><tr class=''><td>2021-04-09td><td class='cRed'>5.15td><td class='cRed'>5.24td><td class='cRed'>5.11td><td class='cRed'>5.12td><td class='cRed'>0.05td><td class='cRed'>0.99td><td>47,592td><td>2,458td><td>2.56td><td>1.49td>tr><tr class='dbrow'><td>2021-04-08td><td class='cRed'>5.29td><td class='cRed'>5.29td><td class='cGreen'>5.07td><td class='cGreen'>5.07td><td class='cGreen'>-0.18td><td class='cGreen'>-3.43td><td>54,769td><td>2,815td><td>4.19td><td>1.71td>tr><tr class=''><td>2021-04-07td><td class='cGreen'>5.21td><td class='cRed'>5.28td><td class='cGreen'>5.19td><td class='cRed'>5.25td><td class='cRed'>0.04td><td class='cRed'>0.77td><td>31,196td><td>1,632td><td>1.73td><td>0.97td>tr><tr class='dbrow'><td>2021-04-06td><td class='cRed'>5.19td><td class='cRed'>5.26td><td class='cGreen'>5.13td><td class='cRed'>5.21td><td class='cRed'>0.04td><td class='cRed'>0.77td><td>38,493td><td>2,005td><td>2.51td><td>1.20td>tr><tr class=''><td>2021-04-02td><td class='cGreen'>5.06td><td class='cRed'>5.18td><td class='cGreen'>5.06td><td class='cRed'>5.17td><td class='cRed'>0.04td><td class='cRed'>0.78td><td>43,675td><td>2,231td><td>2.34td><td>1.36td>tr><tr class='dbrow'><td>2021-04-01td><td class='cRed'>5.30td><td class='cRed'>5.34td><td class='cGreen'>5.05td><td class='cGreen'>5.13td><td class='cGreen'>-0.13td><td class='cGreen'>-2.47td><td>57,846td><td>2,967td><td>5.51td><td>1.81td>tr>tr> table>
div>
通过查看网页源码,找到数据所在的标签是:名字为inner_box的div下的,名字为table_bg001 border_box limit_sale的table下的,每一个tr标签中的内容。这就可以通过soup的find方法找到数据了
list = soup.find("div", class_="inner_box").
find("table", class_="table_bg001 border_box limit_sale")
list中存放的就是一堆tr标签,仍然保留html的格式。
dataList = list.find_all("tr")[1:]
因为table下的第一项是一系列的th标签,没有数据,所以去掉这一项。
<td>2021-05-14td>
<td class='cRed'>6.94td>
<td class='cRed'>8.10td>
<td class='cRed'>6.78td>
<td class='cRed'>8.10td>
<td class='cRed'>1.35td>
<td class='cRed'>20.00td>
<td>489,575td>
<td>36,666td>
<td>19.56td>
<td>15.30td>
再看一下每一天数据的格式。
f = open("data.txt", "a+", encoding="utf-8")
for item in dataList:
kv = {
}
if isinstance(item, bs4.element.Tag):
tdList = item.find_all("td")
# print(tdList)
kv["日期"] = tdList[0].text # 去掉td和/td,取中间的内容
kv["开盘价"] = tdList[1].text
kv["最高价"] = tdList[2].text
kv["最低价"] = tdList[3].text
kv["收盘价"] = tdList[4].text
kv["成交量"] = tdList[7].text
month_data.append(kv)
to_write = json.dumps(kv)
f.write(to_write)
f.close()
把每一天的数据保存成一个字典,把一只股票的数据保存到一个列表中。tdlist存放了一天数据,通过下标找到对应的列数据,同时去掉标签tr,并保存到txt中。
def Storage(data):
my_database = mysql.connector.connect(
host="localhost",
user="root",
passwd="123456",
auth_plugin='mysql_native_password'
)
cursor = my_database.cursor()
sql_createDataBase = "create database if not exists stockData"
cursor.execute(sql_createDataBase)
sql_useDataBase = "USE stockData"
cursor.execute(sql_useDataBase)
sql_createTable = '''create table if not exists data(
date DATE,/
opening_price float,
closing_price float,
highest float,
lowest float)
'''
cursor.execute(sql_createTable)
for item in data:
sql_Insert = '''Insert into data values
('{0}',{1},{2},{3},{4})'''.format(item['日期'], item['开盘价'], item['收盘价'], item['最高价'], item['最低价'])
cursor.execute(sql_Insert)
cursor.close()
my_database.commit()
my_database.close()
print(my_database)
首先建立数据库连接,创建新的数据库和股票数据表,列名为爬取的数据内容。将之前的数据插入数据库,然后关闭数据库连接。注意在python中执行sql语句需要通过由标的execute函数执行,对于插入,运用了字符串的format函数,预留出每一个值,填入数组的对应位置。
因为画K线图函数需要用到元组形式的数据,这里做一个数据处理
def data_Pretreatment(month_data, date, opening_price, closing_price, highest, lowest):
for data in month_data:
my_date = data.get("日期")
my_open = data.get("开盘价")
my_close = data.get("收盘价")
my_high = data.get("最高价")
my_low = data.get("最低价")
date.append(my_date)
opening_price.append(my_open)
closing_price.append(my_close)
highest.append(my_high)
lowest.append(my_low)
遍历预处理之后的数据,整理成新的列表。
def draw(month_data,lowest):
bar1 = Bar()
bar1.add_xaxis(date)
bar1.add_yaxis("最低价", lowest)
bar1.set_series_opts(
# 是否显示标签
label_opts=opts.LabelOpts(is_show=False)
, markpoint_opts=opts.MarkPointOpts(
data=[opts.MarkPointItem(type_="max", name="max"),
opts.MarkPointItem(name="min", type_="min")]
),
markline_opts=opts.MarkLineOpts(data=[opts.MarkLineItem(name="average", type_="average")]))
bar1.set_global_opts(
xaxis_opts=opts.AxisOpts(
axislabel_opts=opts.LabelOpts(rotate=-60, font_size=10),
),
yaxis_opts=opts.AxisOpts(
name="价格:(元/股)",
),
)
bar1.render("最低价.html")
这里用到了一个交互性的库pyecharts,可以把数据保存为静态html,而且可以与用户交互。注意到数据有三十个,一页是放不下的,因此设置数据的横坐标倾斜角度为60,保证能够在一页内显示所有数据。
def draw_K(opening_price, closing_price, lowest, highest):
kline = Kline("K线图")
v1 = []
size = len(opening_price)
for i in range(0, size):
tmp = [opening_price[i], closing_price[i], lowest[i], highest[i]]
v1.append(tmp)
print(v1)
kline.add("日K",
["2021/4/{}".format(i + 1) for i in range(30)],
v1,
mark_point=["max", "min"],
is_datazoom_show=True,
datazoom_orient="horizontal",
mark_line_valuedim="close",
)
kline.render("K图.html")
通过处理之后的数据,很方便的作为画图函数的4个参数传入,这里同样有数据太多无法显示的问题,因为K线图细节很多,无法缩小,因此设置水平方向的滚动条,可以看到完整的数据图像。
这也是数据处理,一方面GUI可以显示Excel表格,另一方面下一张股票分析图需要DataFrame类型的股票数据,在这个函数里实现这两个功能。
def store_in_dataframe(month_data):
my_list = []
for item in month_data:
tmp = list(item.values())
my_list.append(tmp)
# print(my_list)
my_dataframe = pd.DataFrame(my_list,
columns=["datetime", 'open', 'close', 'low', 'high', "trade_sum"])
print(my_dataframe)
my_dataframe.to_excel("数据.xlsx")
return my_dataframe
这是结合K线图对股票信息进行进一步分析,结合pyecharts,可以做到:
1.在一张图上显示多条曲线(Overlap的功能),用户可以自行选择隐藏不需要的曲线。
2.在K线图下方显示股票当天成交量的柱状图。
3.提示用户股票buy和sell的点供参考。
def back_testing_plot(table_name, indicator_name_list):
# data preparation
da = pd.DataFrame(data=table_name)
# da['trade_sum'] = da['trade_sum'].apply(lambda vol: vol if vol > 0 else 0)
date = da["datetime"].apply(lambda x: str(x)).tolist()
k_plot_value = da.apply(lambda record: [record['open'],
record['close'],
record['low'],
record['high']],
axis=1).tolist()
# K chart
kline = Kline()
kline.add("Back_testing Result", date, k_plot_value)
indicator_lines = Line()
for indicator_name in indicator_name_list:
indicator_lines.add(indicator_name,
date,
da[indicator_name].tolist(),
mark_point=["max", "min"],
)
# trading volume bar chart
bar = Bar()
print(type(max(da["trade_sum"])))
bar.add("trade_sum", date, da["trade_sum"],
tooltip_tragger="axis",
is_legend_show=False,
is_yaxis_show=False,
yaxis_max=5 * max(da["trade_sum"]),
)
# buy and sell
v1 = date[10]
v2 = da['high'].iloc[10]
es = EffectScatter("buy")
es.add("buy", [v1], [v2])
v1 = date[18]
v2 = da['high'].iloc[18]
es.add("sell", [v1], [v2], symbol="pin", )
overlap = Overlap()
overlap.add(kline)
overlap.add(indicator_lines, )
overlap.add(bar)
overlap.add(es)
overlap.render(path='高级图.html')
class MyFrame(wx.Frame):
data = []
column_names = []
stack_Code = 0
def __init__(self, data, column_names):
super().__init__(parent=None, title="股票数据显示界面", size=(600, 600))
self.data = data
self.column_names = column_names
self.Centre()
panel = wx.Panel(parent=self)
# self.message1 = wx.StaticText()
# self.message1.SetLabelText("请输入股票代码")
# self.message1.SetPosition((400,370))
self.number = wx.TextCtrl(panel, pos=(450, 370))
# query_button = wx.Button(parent=panel, id=1, label='查询', pos=(450, 400))
post = wx.Button(parent=panel, id=2, label='更新', pos=(450, 435))
show = wx.Button(parent=panel, id=3, label='查看表格', pos=(450, 470))
# self.Bind(wx.EVT_BUTTON, self.on_click, query_button)
self.Bind(wx.EVT_BUTTON, self.on_click, post)
self.Bind(wx.EVT_BUTTON, self.on_click, show)
# self.Bind(wx.EVT_TEXT, self.EvtText)
# 建立表格
def generate_xlsx(self):
self.grid = self.CreateGrid(self)
self.Bind(wx.grid.EVT_GRID_LABEL_LEFT_CLICK, self.OnLabelLeftClick)
def on_click(self, event):
event_id = event.GetId()
print(event_id)
if event_id == 1:
print("查询K图")
elif event_id == 2:
self.stack_Code = self.number.GetValue()
print(self.number.GetValue())
update_xlsx(self, self.stack_Code)
elif event_id == 3:
self.generate_xlsx()
def OnLabelLeftClick(self, event):
print("RowIdx:{0}".format(event.GetRow()))
print("ColIdx:{0}".format(event.GetCol()))
print(self.data[event.GetRow()])
event.Skip()
def CreateGrid(self, parent):
grid = wx.grid.Grid(parent)
grid.CreateGrid(len(self.data), len(self.data[0]))
for row in range(len(self.data)):
for col in range(len(self.data[row])):
grid.SetColLabelValue(col, self.column_names[col])
grid.SetCellValue(row, col, self.data[row][col])
# 设置行和列自定调整
grid.AutoSize()
return grid
class App(wx.App):
data = []
column_names = []
def show(self):
frame = MyFrame(self.data, self.column_names)
frame.Show()
return True
def update_xlsx(app, stack_code):
newData = getHtml(stack_code)
month_data = getData(newData)
my_list = []
for item in month_data:
tmp = list(item.values())
my_list.append(tmp)
app.data = my_list
def update(month_data, date, opening_price, closing_price, highest, lowest):
data_Pretreatment(month_data, date, opening_price, closing_price, highest, lowest)
draw_K(opening_price, closing_price, lowest, highest)
my_dataframe = store_in_dataframe(month_data)
back_testing_plot(my_dataframe, my_dataframe)
细节:
向界面的panel添加两个按钮,同时指定on-click函数和update函数,将按钮和函数绑定,当监听到指定id的按钮被按下,就执行相应的更新函数,绘制新的K图和表格。
功能:
界面有一个输入框和两个按钮,用户可以直接在界面上看到股票的Excel数据,也可以通过保存在file目录下的静态网页查看之前绘制的任何一张图标。
当用户输入新的股票代号并点击更新时,首先会重新爬取股票数据,同时更新Excel表格中的数据和K线图、股票数据分析图中的数据。
def main():
htmlText = getHtml(str(600975))
month_data = getData(htmlText)
month_data.sort(key=lambda k: k['日期'])
month_data.sort(key=lambda x: pd.datetime.strptime("x['日期']", '%d/%m/%Y'))
print(month_data)
Storage(month_data)
date = []
opening_price = []
closing_price = []
highest = []
lowest = []
draw(month_data, date, opening_price, closing_price, highest, lowest)
app = App()
my_list = []
for item in month_data:
tmp = list(item.values())
my_list.append(tmp)
app.data = my_list
app.column_names = ["datetime", 'open', 'close', 'low', 'high', "trade_sum"]
print(app.data)
app.show()
app.MainLoop()
if __name__ == '__main__':
main()
默认显示国家电力的数据
输入新的股票代码,更新表格中的数据,同时更新K线图。
三组不同的数据
没有能够把K线图嵌入到界面中,观察不够方便。
1.学会了基本的html知识,懂得利用浏览器开发者工具获取需要的网络数据。
2.实践了在高级语言中进行sql语句的嵌入执行。
3.理解了Dataframe数据格式在库函数调用中的广泛运用,熟练掌握了数据预处理和格式转换。
4.面对数据无法全部显示的时候,能根据用户对数据的需求和数据的特性,修改数据的显示特征(压缩和滚动显示)。
5.实现前台界面和后台数据的交互,保证数据的实时更新。