目录
层次化索引
根据级别汇总统计
使用DataFrame的列
层次化索引是pandas的一项重要功能,它使你在一个轴上拥有多个索引级别,可以是你以低维度的形式处理高维度的数据。
levels是索引集合和它的空间结构
labels是索引在levels中索引的集合
> from pandas import DataFrame,Series
Backend TkAgg is interactive backend. Turning interactive mode on.
>>> import pandas as pd
>>> import numpy as np
>>> data = Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
>>> data
a 1 -0.070153
2 0.017225
3 0.905866
b 1 -0.156584
2 0.213097
3 0.263765
c 1 -0.141315
2 1.175804
d 2 0.812828
3 -0.820116
dtype: float64
>>> data.index
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
对于层次化索引,选取数据子集操作很简单,也可以通过索引在内层进行选取
>>> data['b']
1 -0.156584
2 0.213097
3 0.263765
dtype: float64
>>> data['b':'c']
b 1 -0.156584
2 0.213097
3 0.263765
c 1 -0.141315
2 1.175804
dtype: float64
>>> data.ix[['b','d']]
b 1 -0.156584
2 0.213097
3 0.263765
d 2 0.812828
3 -0.820116
dtype: float64
>>> data[:,2]
a 0.017225
b 0.213097
c 1.175804
d 0.812828
dtype: float64
数据可以通过unstack方法被安排到新的DataFrame中。也可过逆运算变回。
>>> data.unstack()
1 2 3
a -0.070153 0.017225 0.905866
b -0.156584 0.213097 0.263765
c -0.141315 1.175804 NaN
d NaN 0.812828 -0.820116
>>> data.unstack().stack()
a 1 -0.070153
2 0.017225
3 0.905866
b 1 -0.156584
2 0.213097
3 0.263765
c 1 -0.141315
2 1.175804
d 2 0.812828
3 -0.820116
dtype: float64
每层索引可以设置名字
>>> frame = DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Onio','Onio','Colorado'],['Green','Red','Green']])
>>> frame
Onio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
>>> frame.stack()
Colorado Onio
a 1 Green 2.0 0
Red NaN 1
2 Green 5.0 3
Red NaN 4
b 1 Green 8.0 6
Red NaN 7
2 Green 11.0 9
Red NaN 10
>>> frame.index.names=['key1','key2']
>>> frame.columns.names = ['state','color']
>>> frame
state Onio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
>>> frame['Onio']
Green Red
a 1 0 1
2 3 4
b 1 6 7
2 9 10
>>> frame.swaplevel('key1','key2')
state Onio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
>>> frame.sum(level='key2')
state Onio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16
>>> frame.sum(level='color',axis=1)
color Green Red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
DataFrame的set_index函数会将一个或多个列转换为行索引,并创建一个新的DataFrame
默认情况下,那些列会从DataFrame中移除,但也可以将其保留下来。
>>> frame = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]})
>>> frame
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
>>> frame2= frame.set_index(['c','d'])
>>> frame2
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
>>> frame.set_index(['c','d'],drop=False)
a b c d
c d
one 0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
two 0 3 4 two 0
1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
>>> frame2.reset_index()
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1