开公众号啦,「心智万花筒」欢迎一起读书交流成长。
雅虎财经
利用Pandas模块直接获取雅虎财经数据,方便之极。注意把官方提示把from pandas.io import data, wb
替换为from pandas_datareader import data, wb
。
Pandas for finance 文档。
上证指数000001.SS.
import pandas as pd
import numpy as np
from pandas_datareader import data, wb # 需要安装 pip install pandas_datareader
import datetime
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline
可以直接下载数据
网站提供了csv格式数据下载服务。下载,然后读取sh_table = pd.read_csv('sh_table.csv')
。
利用DataReader抓取数据
# 定义获取数据的时间段
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2016,5,20)
sh = data.DataReader("000001.SS", 'yahoo', start, end)
sh.head(3) # 数据获取成功
Date |
Open |
High |
Low |
Close |
Volume |
Adj Close |
2010-01-04 |
3243.76 |
3243.76 |
3243.76 |
3243.76 |
0 |
3243.76 |
2010-01-05 |
3282.18 |
3282.18 |
3282.18 |
3282.18 |
0 |
3282.18 |
2010-01-06 |
3254.22 |
3254.22 |
3254.22 |
3254.22 |
0 |
3254.22 |
sh.describe() # 数据整体概览
|
Open |
High |
Low |
Close |
Volume |
Adj Close |
count |
1558.000000 |
1558.000000 |
1558.000000 |
1558.000000 |
1558.0 |
1558.000000 |
mean |
2659.427452 |
2659.427452 |
2659.427452 |
2659.427452 |
0.0 |
2659.427452 |
std |
595.170673 |
595.170673 |
595.170673 |
595.170673 |
0.0 |
595.170673 |
min |
1950.010000 |
1950.010000 |
1950.010000 |
1950.010000 |
0.0 |
1950.010000 |
25% |
2199.832500 |
2199.832500 |
2199.832500 |
2199.832500 |
0.0 |
2199.832500 |
50% |
2492.600000 |
2492.600000 |
2492.600000 |
2492.600000 |
0.0 |
2492.600000 |
75% |
2968.977500 |
2968.977500 |
2968.977500 |
2968.977500 |
0.0 |
2968.977500 |
max |
5166.350000 |
5166.350000 |
5166.350000 |
5166.350000 |
0.0 |
5166.350000 |
sh['Close'].plot(); #看一下收盘趋势
[图片上传失败...(image-81674d-1569669418614)]
数据读取和输出pd.read_csv
and to_csv
从文件读取数据是非常常见的操作
sh.to_csv('sh.csv',header=None)
names = ['Date','Open','High','Low','Close','Volume','Adj Close']
sh1 = pd.read_csv('sh.csv',names=names,index_col='Date')
sh1.tail(2)
Date |
Open |
High |
Low |
Close |
Volume |
Adj Close |
2016-05-19 |
2806.91 |
2806.91 |
2806.91 |
2806.91 |
0 |
2806.91 |
2016-05-20 |
2825.48 |
2825.48 |
2825.48 |
2825.48 |
0 |
2825.48 |
清洗数据
# drop Volume列,所以数据都为0
sh = sh.drop('Volume',axis=1)
sh.head(2) # Volume列消失了
Date |
Open |
High |
Low |
Close |
Adj Close |
2010-01-04 |
3243.76 |
3243.76 |
3243.76 |
3243.76 |
3243.76 |
2010-01-05 |
3282.18 |
3282.18 |
3282.18 |
3282.18 |
3282.18 |
检查缺失值
有些时候会有数据缺失,不完整,需要查看一下。
# Detect missing values,用isnull函数
pd.isnull(sh).head() # or sh.isnull()
Date |
Open |
High |
Low |
Close |
Adj Close |
2010-01-04 |
False |
False |
False |
False |
False |
2010-01-05 |
False |
False |
False |
False |
False |
2010-01-06 |
False |
False |
False |
False |
False |
2010-01-07 |
False |
False |
False |
False |
False |
2010-01-08 |
False |
False |
False |
False |
False |
但返回的是一个boolean型的DataFrame,我只想先看一下是否有缺失值,不用肉眼。
利用any()函数,如果有可以用sum()函数查看有多少。
#参考 http://stackoverflow.com/questions/29530232/python-pandas-check-if-any-value-is-nan-in-dataframe
sh.isnull().values.any() # 数据很完整
False
sh.isnull().values.sum() # 没有值
0
sh.head()
Date |
Open |
High |
Low |
Close |
Adj Close |
2010-01-04 |
3243.76 |
3243.76 |
3243.76 |
3243.76 |
3243.76 |
2010-01-05 |
3282.18 |
3282.18 |
3282.18 |
3282.18 |
3282.18 |
2010-01-06 |
3254.22 |
3254.22 |
3254.22 |
3254.22 |
3254.22 |
2010-01-07 |
3192.78 |
3192.78 |
3192.78 |
3192.78 |
3192.78 |
2010-01-08 |
3196.00 |
3196.00 |
3196.00 |
3196.00 |
3196.00 |
填补缺失值
这里把一些位置设置为np.nan,然后填充,没有实际意义,这里只是拿来练手。
设置nan如下:
sh.iloc[0,:] = np.nan
sh.iloc[[1,3],1] = np.nan
sh.iloc[2,2] = np.nan
sh.iloc[3,3] = np.nan
sh.head(4)
Date |
Open |
High |
Low |
Close |
Adj Close |
2010-01-04 |
NaN |
NaN |
NaN |
NaN |
NaN |
2010-01-05 |
3282.18 |
NaN |
3282.18 |
3282.18 |
3282.18 |
2010-01-06 |
3254.22 |
3254.22 |
NaN |
3254.22 |
3254.22 |
2010-01-07 |
3192.78 |
NaN |
3192.78 |
NaN |
3192.78 |
# check NaN number per column
def num_missing(x):
return x.isnull().sum()
print "Missing values count per column:"
print sh.apply(num_missing, axis=0)
Missing values count per column:
Open 1
High 3
Low 2
Close 2
Adj Close 1
dtype: int64
dropna
有几个参数
-
how='all'
只有全部为NaN的行才drop,若axis=1则对列;
- 若
how='any'
默认,则drop所有含NaN的行或列;
-
inplacce=True
则inplace操作,不返回;
- 默认
inplace=False
,返回一个drop后的,不改变原DataFrame
sh.dropna(how='all',inplace=True);
sh.head(3)
Date |
Open |
High |
Low |
Close |
Adj Close |
2010-01-05 |
3282.18 |
NaN |
3282.18 |
3282.18 |
3282.18 |
2010-01-06 |
3254.22 |
3254.22 |
NaN |
3254.22 |
3254.22 |
2010-01-07 |
3192.78 |
NaN |
3192.78 |
NaN |
3192.78 |
填充fillna
sh['High'].fillna(sh['High'].mean(),inplace=True) #fill in with mean
sh['Low'].fillna(method='ffill',inplace=True) #fill in with value from previous row
sh['Close'].fillna(method='bfill',inplace=True) # fill in with value from the row behind
sh.head()
Date |
Open |
High |
Low |
Close |
Adj Close |
2010-01-05 |
3282.18 |
2658.308199 |
3282.18 |
3282.18 |
3282.18 |
2010-01-06 |
3254.22 |
3254.220000 |
3282.18 |
3254.22 |
3254.22 |
2010-01-07 |
3192.78 |
2658.308199 |
3192.78 |
3196.00 |
3192.78 |
2010-01-08 |
3196.00 |
3196.000000 |
3196.00 |
3196.00 |
3196.00 |
2010-01-11 |
3212.75 |
3212.750000 |
3212.75 |
3212.75 |
3212.75 |
sh.isnull().values.sum()
0
计算涨跌额
涨跌额是指当日股票价格与前一日收盘价格相比的涨跌数值。添加一列change,其为当日close价格与之前一天的差值。当然注意这里数据有缺失,有的日期没有记录。
change = sh.Close.diff()
change.fillna(change.mean(),inplace=True)
sh['Change'] = change
sh.head(3)
Date |
Open |
High |
Low |
Close |
Adj Close |
Change |
2010-01-05 |
3282.18 |
2658.308199 |
3282.18 |
3282.18 |
3282.18 |
-0.293509 |
2010-01-06 |
3254.22 |
3254.220000 |
3282.18 |
3254.22 |
3254.22 |
-27.960000 |
2010-01-07 |
3192.78 |
2658.308199 |
3192.78 |
3196.00 |
3192.78 |
-58.220000 |
# 按涨跌额排序,未改变sh,若需要,可以`inplace=True`
sh.sort_values(by='Change',ascending=False).head(3)
Date |
Open |
High |
Low |
Close |
Adj Close |
Change |
2015-10-12 |
3287.66 |
3287.66 |
3287.66 |
3287.66 |
3287.66 |
234.88 |
2015-06-30 |
4277.22 |
4277.22 |
4277.22 |
4277.22 |
4277.22 |
224.19 |
2015-06-01 |
4828.74 |
4828.74 |
4828.74 |
4828.74 |
4828.74 |
217.00 |
计算涨跌幅
sh.head()
Date |
Open |
High |
Low |
Close |
Adj Close |
Change |
2010-01-05 |
3282.18 |
2658.308199 |
3282.18 |
3282.18 |
3282.18 |
-0.293509 |
2010-01-06 |
3254.22 |
3254.220000 |
3282.18 |
3254.22 |
3254.22 |
-27.960000 |
2010-01-07 |
3192.78 |
2658.308199 |
3192.78 |
3196.00 |
3192.78 |
-58.220000 |
2010-01-08 |
3196.00 |
3196.000000 |
3196.00 |
3196.00 |
3196.00 |
0.000000 |
2010-01-11 |
3212.75 |
3212.750000 |
3212.75 |
3212.75 |
3212.75 |
16.750000 |
# 用shift方法错位
# sh['pct_change'] = ((sh['Change'] - sh['Change'].shift(1)) / sh['Change'])
# 或用pct_Change函数
sh['pct_change'] = sh.Change.pct_change()
sh.iloc[5:9]
Date |
Open |
High |
Low |
Close |
Adj Close |
Change |
pct_change |
2010-01-12 |
3273.97 |
3273.97 |
3273.97 |
3273.97 |
3273.97 |
61.22 |
2.654925 |
2010-01-13 |
3172.66 |
3172.66 |
3172.66 |
3172.66 |
3172.66 |
-101.31 |
-2.654851 |
2010-01-14 |
3215.55 |
3215.55 |
3215.55 |
3215.55 |
3215.55 |
42.89 |
-1.423354 |
2010-01-15 |
3224.15 |
3224.15 |
3224.15 |
3224.15 |
3224.15 |
8.60 |
-0.799487 |
Apply and Applymap
利用applymap格式化数据,均保留两位小数。这是对其进行element-wise操作
sh.head(2)
Date |
Open |
High |
Low |
Close |
Adj Close |
Change |
pct_change |
2010-01-05 |
3282.18 |
2658.308199 |
3282.18 |
3282.18 |
3282.18 |
-0.293509 |
NaN |
2010-01-06 |
3254.22 |
3254.220000 |
3282.18 |
3254.22 |
3254.22 |
-27.960000 |
94.261134 |
format = lambda x: '%.2f' % x
sh = sh.applymap(format)
sh.head(2)
Date |
Open |
High |
Low |
Close |
Adj Close |
Change |
pct_change |
2010-01-05 |
3282.18 |
2658.31 |
3282.18 |
3282.18 |
3282.18 |
-0.29 |
nan |
2010-01-06 |
3254.22 |
3254.22 |
3282.18 |
3254.22 |
3254.22 |
-27.96 |
94.26 |
sh = sh.applymap(float) # 可以重新把数据类型转换为float类型
sh.ix[1]
Open 3254.22
High 3254.22
Low 3282.18
Close 3254.22
Adj Close 3254.22
Change -27.96
pct_change 94.26
Name: 2010-01-06 00:00:00, dtype: float64
利用apply找出每列最值及变化范围
def f(x):
return pd.Series([x.min(),x.max(),x.max()-x.min()],index=['min','max','range'])
sh.apply(f)
Group by Month
手工
# 按月分组
sh['group_index'] = sh.index.map(lambda x: 100*x.year + x.month)
sh.head(2).append(sh.tail(2)) # 查看group_index, as expected
Date |
Open |
High |
Low |
Close |
Adj Close |
Change |
pct_change |
group_index |
2010-01-05 |
3282.18 |
2658.31 |
3282.18 |
3282.18 |
3282.18 |
-0.29 |
NaN |
201001 |
2010-01-06 |
3254.22 |
3254.22 |
3282.18 |
3254.22 |
3254.22 |
-27.96 |
94.26 |
201001 |
2016-05-19 |
2806.91 |
2806.91 |
2806.91 |
2806.91 |
2806.91 |
-0.60 |
-0.98 |
201605 |
2016-05-20 |
2825.48 |
2825.48 |
2825.48 |
2825.48 |
2825.48 |
18.57 |
-31.95 |
201605 |
sh.groupby('group_index') # 返回一个groupby对象
# 对不同的列运用不同的函数聚合
def peak_to_peak(arr):
return arr.max() - arr.min()
grouped = sh.groupby('group_index').agg({
'Open' : peak_to_peak,
'High' : 'max',
'Low' : 'min',
'Change' : 'mean'
})
grouped.head(3).append(grouped.tail(3))
group_index |
High |
Open |
Low |
Change |
201001 |
3273.97 |
295.57 |
2986.61 |
-15.430526 |
201002 |
3060.62 |
125.91 |
2934.71 |
3.132500 |
201003 |
3128.47 |
151.53 |
2976.94 |
2.485217 |
201603 |
3018.80 |
285.63 |
2733.17 |
13.736522 |
201604 |
3082.36 |
144.04 |
2938.32 |
-3.452632 |
201605 |
2997.84 |
190.93 |
2806.91 |
-8.060000 |
用pandas时间序列功能更便捷
sh.index # DatetimeIndex 提供很多便捷操作
DatetimeIndex(['2010-01-05', '2010-01-06', '2010-01-07', '2010-01-08',
'2010-01-11', '2010-01-12', '2010-01-13', '2010-01-14',
'2010-01-15', '2010-01-18',
...
'2016-05-09', '2016-05-10', '2016-05-11', '2016-05-12',
'2016-05-13', '2016-05-16', '2016-05-17', '2016-05-18',
'2016-05-19', '2016-05-20'],
dtype='datetime64[ns]', name=u'Date', length=1557, freq=None)
grouped_1 = sh.groupby(pd.TimeGrouper("M")).agg({
'Open' : peak_to_peak,
'High' : 'max',
'Low' : 'min',
'Change' : 'mean'
})
grouped_1.head(3).append(grouped_1.tail(3))
Date |
High |
Open |
Low |
Change |
2010-01-31 |
3273.97 |
295.57 |
2986.61 |
-15.430526 |
2010-02-28 |
3060.62 |
125.91 |
2934.71 |
3.132500 |
2010-03-31 |
3128.47 |
151.53 |
2976.94 |
2.485217 |
2016-03-31 |
3018.80 |
285.63 |
2733.17 |
13.736522 |
2016-04-30 |
3082.36 |
144.04 |
2938.32 |
-3.452632 |
2016-05-31 |
2997.84 |
190.93 |
2806.91 |
-8.060000 |
或者用resample实现重采样
可以downsampling,也可以upsampling
# resample('M',how='mean')这样的语法将来不支持,推荐.mean()以及.apply()
# 可以传入词典给apply
grouped_2 = sh[['Open','High','Low','Change']].resample('M').apply({
'Open' : peak_to_peak,
'High' : 'max',
'Low' : 'min',
'Change' : 'mean'
})
grouped_2.head(3).append(grouped_2.tail(3))
Date |
High |
Open |
Low |
Change |
2010-01-31 |
3273.97 |
295.57 |
2986.61 |
-15.430526 |
2010-02-28 |
3060.62 |
125.91 |
2934.71 |
3.132500 |
2010-03-31 |
3128.47 |
151.53 |
2976.94 |
2.485217 |
2016-03-31 |
3018.80 |
285.63 |
2733.17 |
13.736522 |
2016-04-30 |
3082.36 |
144.04 |
2938.32 |
-3.452632 |
2016-05-31 |
2997.84 |
190.93 |
2806.91 |
-8.060000 |
sh[['Open']].plot();
plt.show()
[图片上传失败...(image-a6d22c-1569669418614)]