为了更好的理解这些基本操作,我们将读取一个真实的股票数据。关于文件操作,后面在介绍,这里只先用一下API
# 读取文件
data = pd.read_csv("./stock_day/stock_day.csv")
# 删除一些列,让数据更简单些,再去做后面的操作
data = data.drop(["ma5","ma10","ma20","v_ma5","v_ma10","v_ma20"], axis=1)
Numpy当中我们已经讲过使用索引选取序列和切片选择,pandas也支持类似的操作,也可以直接使用列名、行名称,甚至组合使用。
获取’2018-02-27’这天的’close’的结果
# 直接使用行列索引名字的方式(先列后行)
data['open']['2018-02-27']
23.53
# 不支持的操作
# 错误
data[['2018-02-27']]['open']
# 错误
data[:1, :2]
使用loc获取从’2018-02-27’:’2018-02-22’,’open’的结果
# 使用loc:只能指定行列索引的名字
data.loc['2018-02-27':'2018-02-22', 'open']
结果:
2018-02-27 23.53
2018-02-26 22.80
2018-02-23 22.88
Name: open, dtype: float64
使用iloc获取从’2018-02-27’:’2018-02-22’,’open’的结果
# 使用iloc可以通过索引的下标去获取
# 获取前100天数据的'open'列的结果
data.iloc[0:100, 0:2].head()
结果:
open high close low
2018-02-27 23.53 25.88 24.16 23.53
2018-02-26 22.80 23.78 23.53 22.80
2018-02-23 22.88 23.37 22.82 22.71
获取行第1天到第4天,[‘open’, ‘close’, ‘high’, ‘low’]这个四个指标的结果
# 使用ix进行下表和名称组合做引
data.ix[0:4, ['open', 'close', 'high', 'low']]
# 推荐使用loc和iloc来获取的方式
data.loc[data.index[0:4], ['open', 'close', 'high', 'low']]
data.iloc[0:4, data.columns.get_indexer(['open', 'close', 'high', 'low'])]
三种查询方式结果如下:
open close high low
2018-02-27 23.53 24.16 25.88 23.53
2018-02-26 22.80 23.53 23.78 22.80
2018-02-23 22.88 22.82 23.37 22.71
2018-02-22 22.25 22.28 22.76 22.02
对DataFrame当中的close列进行重新赋值为1
# 直接修改原来的值
data['close'] = 1
# 或者
data.close = 1
注意:这是整列的赋值操作
排序有两种形式,一种对内容进行排序,一种对索引进行排序
# 按照涨跌幅大小进行排序 , 使用ascending指定按照大小排序
data = data.sort_values(by='p_change', ascending=False).head()
open high close low volume price_change p_change turnover
2015-08-28 15.40 16.46 16.46 15.00 117827.60 1.50 10.03 4.03
2015-05-21 27.50 28.22 28.22 26.50 121190.11 2.57 10.02 4.15
2016-12-22 18.50 20.42 20.42 18.45 150470.83 1.86 10.02 3.77
2015-08-04 16.20 17.35 17.35 15.80 94292.63 1.58 10.02 3.23
2016-07-07 18.66 18.66 18.66 18.41 48756.55 1.70 10.02 1.67
# 按照多个键进行排序
data = data.sort_values(by=['open', 'high'])
open high close low volume price_change p_change turnover
2015-06-15 34.99 34.99 31.69 31.69 199369.53 -3.52 -10.00 6.82
2015-06-12 34.69 35.98 35.21 34.01 159825.88 0.82 2.38 5.47
2015-06-10 34.10 36.35 33.85 32.23 269033.12 0.51 1.53 9.21
2017-11-01 33.85 34.34 33.83 33.10 232325.30 -0.61 -1.77 5.81
2015-06-11 33.17 34.98 34.39 32.51 173075.73 0.54 1.59 5.92
这个股票的日期索引原来是从大到小,现在重新排序,从小到大
# 对索引进行排序
data.sort_index()
open high close low volume price_change p_change turnover
2015-03-02 12.25 12.67 12.52 12.20 96291.73 0.32 2.62 3.30
2015-03-03 12.52 13.06 12.70 12.52 139071.61 0.18 1.44 4.76
2015-03-04 12.80 12.92 12.90 12.61 67075.44 0.20 1.57 2.30
2015-03-05 12.88 13.45 13.16 12.87 93180.39 0.26 2.02 3.19
2015-03-06 13.17 14.48 14.28 13.13 179831.72 1.12 8.51 6.16
series排序时,只有一列,不需要参数
data['p_change'].sort_values(ascending=True).head()
2015-09-01 -10.03
2015-09-14 -10.02
2016-01-11 -10.02
2015-07-15 -10.02
2015-08-26 -10.01
Name: p_change, dtype: float64
与dataframe一致
# 对索引进行排序
data['p_change'].sort_index().head()
2015-03-02 2.62
2015-03-03 1.44
2015-03-04 1.57
2015-03-05 2.02
2015-03-06 8.51
Name: p_change, dtype: float64
比如给整列数据加上一个数
data['open'].add(1).head()
# data['open'].head() + 1
2018-02-27 24.53
2018-02-26 23.80
2018-02-23 23.88
2018-02-22 23.25
2018-02-14 22.49
整列数据减去一个数或者两个数组之间相减
close = data['close'].head()
open1 = data['open'].head()
data['m_price_change'] = close.sub(open1) # 收盘价减去收盘价等于每天的涨跌
data['m_price_change'].head()
# data['m_price_change'] = data['open'].head() - data['close'].head()
2018-02-27 0.63
2018-02-26 0.73
2018-02-23 -0.06
2018-02-22 0.03
2018-02-14 0.43
Name: m_price_change, dtype: float64
series数据可以直接和逻辑运算符操作
data['p_change'].head() > 2 # 可以直接使用series数据进行逻辑运算
2018-02-27 True
2018-02-26 True
2018-02-23 True
2018-02-22 False
2018-02-14 True
Name: p_change, dtype: bool
完成一个多个逻辑判断, 筛选p_change > 2并且open > 15
data[data['p_change'] > 2 & (data['open'] > 15)].head()
open high close low volume price_change p_change turnover m_price_change
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39 0.63
2018-02-26 22.80 23.78 23.53 22.80 60985.11 0.69 3.02 1.53 0.73
2018-02-23 22.88 23.37 22.82 22.71 52914.01 0.54 2.42 1.32 -0.06
2018-02-22 22.25 22.76 22.28 22.02 36105.01 0.36 1.64 0.90 0.03
2018-02-14 21.49 21.99 21.92 21.48 23331.04 0.44 2.05 0.58 0.43
通过query使得刚才的过程更加方便简单
data.query("p_change > 2 & turnover > 15")
可以指定值进行一个判断,从而进行筛选操作
# 例如判断'turnover'是否为4.19, 2.39
data[data['turnover'].isin([4.19, 2.39])]
open high close low volume price_change p_change turnover my_price_change
2018-02-27 23.53 25.88 24.16 23.53 95578.03 0.63 2.68 2.39 0.63
2017-07-25 23.07 24.20 23.70 22.64 167489.48 0.67 2.91 4.19 0.63
2016-09-28 19.88 20.98 20.86 19.71 95580.75 0.98 4.93 2.39 0.98
2015-04-07 16.54 17.98 17.54 16.50 122471.85 0.88 5.28 4.19 1.00
综合分析: 能够直接得出很多统计结果,count, mean, std, min, max 等
# 计算平均值、标准差、最大值、最小值
data.describe()
Numpy当中已经详细介绍,在这里我们演示min(最小值), max(最大值), mean(平均值), median(中位数), var(方差), std(标准差)
count | Number of non-NA observations |
---|---|
sum | 求和 |
mean | 求平均值 |
median | 求中位数 |
min | 求最小值 |
max | 求最大值 |
mode | Mode |
abs | 求绝对值 |
prod | Product of values |
std | 标准差 |
var | 求方差 |
idxmax | 求最大值的索引 |
idxmin | 求最小值的索引 |
对于单个函数去进行统计的时候,坐标轴还是按照这些默认为“columns” (axis=0, default),如果要对行“index” 需要指定(axis=1)
# 使用统计函数:0 代表列求结果, 1 代表行求统计结果
data.max(0)
open 34.99
high 36.35
close 35.21
low 34.01
volume 501915.41
price_change 3.03
p_change 10.03
turnover 12.56
my_price_change 3.41
dtype: float64
# 方差
data.var(0)
open 1.545255e+01
high 1.662665e+01
close 1.554572e+01
low 1.437902e+01
volume 5.458124e+09
price_change 8.072595e-01
p_change 1.664394e+01
turnover 4.323800e+00
my_price_change 6.409037e-01
dtype: float64
# 标准差
data.std(0)
open 3.930973
high 4.077578
close 3.942806
low 3.791968
volume 73879.119354
price_change 0.898476
p_change 4.079698
turnover 2.079375
my_price_change 0.800565
dtype: float64
中位数为将数据从小到大排列,在最中间的那个数为中位数。如果没有中间数,取中间两个数的平均值。
df = pd.DataFrame({'COL1' : [2,3,4,5,4,2],
'COL2' : [0,1,2,3,4,2]})
df.median()
COL1 3.5
COL2 2.0
dtype: float64
# 求出最大值的位置
data.idxmax(axis=0)
open 2015-06-15
high 2015-06-10
close 2015-06-12
low 2015-06-12
volume 2017-10-26
price_change 2015-06-09
p_change 2015-08-28
turnover 2017-10-26
my_price_change 2015-07-10
dtype: object
# 求出最小值的位置
data.idxmin(axis=0)
open 2015-03-02
high 2015-03-02
close 2015-09-02
low 2015-03-02
volume 2016-07-06
price_change 2015-06-15
p_change 2015-09-01
turnover 2016-07-06
my_price_change 2015-06-15
dtype: object
函数 | 作用 |
---|---|
cumsum | 计算前1/2/3/…/n个数的和 |
cummax | 计算前1/2/3/…/n个数的最大值 |
cummin | 计算前1/2/3/…/n个数的最小值 |
cumprod | 计算前1/2/3/…/n个数的积 |
以上这些函数可以对series和dataframe操作
这里我们按照时间的从前往后来进行累计
data = data.sort_index()
排序之后,进行累计求和
stock_rise = data['p_change']
# plot方法集成了前面直方图、条形图、饼图、折线图
stock_rise.cumsum()
2015-03-02 2.62
2015-03-03 4.06
2015-03-04 5.63
2015-03-05 7.65
2015-03-06 16.16
如果要使用plot函数,需要导入matplotlib
import matplotlib.pyplot as plt
# plot显示图形
stock_rise.cumsum().plot()
# 需要调用show,才能显示出结果
plt.show()
data[['open', 'close']].apply(lambda x: x.max() - x.min(), axis=0)
open 22.74
close 22.85
dtype: float64
DataFrame.plot(x=None, y=None, kind=’line’)
y : 标签,坐标或者标签和坐标的数组
kind : str
更多参数细节:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html?highlight=plot#pandas.DataFrame.plot
直接series.plot(),会以索引为横坐标,数据为纵坐标
更多参数细节:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html?highlight=plot#pandas.Series.plot
pandas可以对许多的文件类型进行操作,这里简单介绍其中常用的几种
读取之前的股票的数据
# 读取文件,并且指定只获取'open', 'high', 'close'指标
data = pd.read_csv("./stock_day/stock_day.csv", usecols=['open', 'high', 'close'])
open high close
2018-02-27 23.53 25.88 24.16
2018-02-26 22.80 23.78 23.53
2018-02-23 22.88 23.37 22.82
2018-02-22 22.25 22.76 22.28
2018-02-14 21.49 21.99 21.92
DataFrame.to_csv(path_or_buf=None, sep=’, ’, columns=None, header=True, index=True, index_label=None, mode=’w’, encoding=None)
Series.to_csv(path=None, index=True, sep=’, ‘, na_rep=”, float_format=None, header=False, index_label=None, mode=’w’, encoding=None, compression=None, date_format=None, decimal=’.’)
案例
- 保存’open’列的数据
# 选取10行数据保存,便于观察数据
data[:10].to_csv("./test.csv", columns=['open'])
pd.read_csv("./test.csv")
Unnamed: 0 open
0 2018-02-27 23.53
1 2018-02-26 22.80
2 2018-02-23 22.88
3 2018-02-22 22.25
4 2018-02-14 21.49
5 2018-02-13 21.40
6 2018-02-12 20.70
7 2018-02-09 21.20
8 2018-02-08 21.79
9 2018-02-07 22.69
会发现将索引存入到文件当中,变成单独的一列数据。如果需要删除,可以指定index参数,删除原来的文件,重新保存一次。
# index:存储不会讲索引值变成一列数据
stock_day[:10].to_csv("./test.csv", columns=['open'], index=False)
stock_day[:10].to_csv("./test.csv", columns=['open'], index=False, mode='a)
open
0 23.53
...
9 22.69
10 open
11 23.53
...
20 22.69
又存进了一个列名,所以当以追加方式添加数据的时候,一定要去掉列名columns,指定header=False
stock_day[:10].to_csv("./test.csv", columns=['open'], index=False, mode='a', header=False)
open
0 23.53
...
7 21.20
8 21.79
9 22.69
10 23.53
11 22.80
12 22.88
13 22.25
...
19 22.69
HDF5文件的读取和存储需要指定一个键,值为要存储的DataFrame
pandas.read_hdf(path_or_buf,key =None,** kwargs)
从h5文件当中读取数据
DataFrame.to_hdf(path_or_buf, key, **kwargs)
案例:
day_high = pd.read_hdf("./stock_data/day/day_high.h5")
day_high.to_hdf("./test.h5", key="day_high")
new_high = pd.read_hdf("./test.h5", key="day_high")
Json是我们常用的一种数据交换格式,前面在前后端的交互经常用到,也会在存储的时候选择这种格式。所以我们需要知道Pandas如何进行读取和存储JSON格式。
案例:
这里使用一个新闻标题讽刺数据集,格式为json。is_sarcastic:1讽刺的,否则为0;headline:新闻报道的标题;article_link:链接到原始新闻文章。存储格式为:
{"article_link": "https://www.huffingtonpost.com/entry/versace-black-code_us_5861fbefe4b0de3a08f600d5", "headline": "former versace store clerk sues over secret 'black code' for minority shoppers", "is_sarcastic": 0}
{"article_link": "https://www.huffingtonpost.com/entry/roseanne-revival-review_us_5ab3a497e4b054d118e04365", "headline": "the 'roseanne' revival catches up to our thorny political mood, for better and worse", "is_sarcastic": 0}
json_read = pd.read_json("Sarcasm_Headlines_Dataset.json", orient="records", lines=True)
结果如下:
案例:
json_read.to_json("./test.json", orient='records')
[{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0},{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1},{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0},{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/advancing-the-worlds-women_b_6810038.html","headline":"advancing the world's women","is_sarcastic":0},....]
json_read.to_json("./test.json", orient='records', lines=True)
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/versace-black-code_us_5861fbefe4b0de3a08f600d5","headline":"former versace store clerk sues over secret 'black code' for minority shoppers","is_sarcastic":0}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/roseanne-revival-review_us_5ab3a497e4b054d118e04365","headline":"the 'roseanne' revival catches up to our thorny political mood, for better and worse","is_sarcastic":0}
{"article_link":"https:\/\/local.theonion.com\/mom-starting-to-fear-son-s-web-series-closest-thing-she-1819576697","headline":"mom starting to fear son's web series closest thing she will have to grandchild","is_sarcastic":1}
{"article_link":"https:\/\/politics.theonion.com\/boehner-just-wants-wife-to-listen-not-come-up-with-alt-1819574302","headline":"boehner just wants wife to listen, not come up with alternative debt-reduction ideas","is_sarcastic":1}
{"article_link":"https:\/\/www.huffingtonpost.com\/entry\/jk-rowling-wishes-snape-happy-birthday_us_569117c4e4b0cad15e64fdcb","headline":"j.k. rowling wishes snape happy birthday in the most magical way","is_sarcastic":0}...
优先选择使用HDF5文件存储