data
city US JP
tenor 1 3 5 1 3
0 -0.423185 0.107952 0.051819 -3.058473 -0.648571
1 0.011324 -1.096422 -0.605934 -0.049505 -0.440209
2 -0.452174 0.876530 1.895581 0.585658 -1.038866
3 -0.087396 -0.666303 0.917700 -0.560344 -0.550330
grouped = data.groupby(level='tenor')
grouped.groups : 得到分组对应的Index
grouped.groups
{1: MultiIndex(levels=[['JP', 'US'], [1, 3, 5]],
codes=[[1, 0], [0, 0]],
names=['city', 'tenor']),
3: MultiIndex(levels=[['JP', 'US'], [1, 3, 5]],
codes=[[1, 0], [1, 1]],
names=['city', 'tenor']),
5: MultiIndex(levels=[['JP', 'US'], [1, 3, 5]],
codes=[[1], [2]],
names=['city', 'tenor'])}
dict(list(grouped))[1] | grouped.get_group(1)得到某个组的数据块
grouped.get_group(1)
city US JP
tenor 1 1
0 -0.423185 -3.058473
1 0.011324 -0.049505
2 -0.452174 0.585658
3 -0.087396 -0.560344
list(grouped)
[(1, city US JP
tenor 1 1
0 -0.423185 -3.058473
1 0.011324 -0.049505
2 -0.452174 0.585658
3 -0.087396 -0.560344), (3, city US JP
tenor 3 3
0 0.107952 -0.648571
1 -1.096422 -0.440209
2 0.876530 -1.038866
3 -0.666303 -0.550330), (5, city US
tenor 5
0 0.051819
1 -0.605934
2 1.895581
3 0.917700)]
1
分组键:
①与需要分组的轴向长度一致的值列表或值数组
②DataFrame的列名的值
③可以将分组轴向上的值和分组名称相匹配的字典或Series
④可以在轴索引或索引中的单个标签上调用的函数
data = pd.DataFrame(np.random.randint(50,100,(5,4)))
stats = np.array(['A','B','B','A','A'])
years = np.array([2005,2005,2006,2005,2006])
data.groupby([stats,years])[1].mean()
A 2005 68.5
2006 54.0
B 2005 62.0
2006 65.0
Name: 1, dtype: float64
data.groupby(['key1','key2']).mean()
data = pd.DataFrame(np.random.randn(5,5),columns=['a','b','c','d','e'],
index = ['Joe','Steve','Wes','Jim','Travis'])
data
a b c d e
Joe -0.257648 -2.155991 -0.335562 0.082991 0.175928
Steve 0.619238 -1.261862 -0.628191 -0.358416 1.025076
Wes -0.854484 2.454894 1.638307 0.711922 -0.265363
Jim -0.044105 0.072261 -1.562865 -1.979348 1.269205
Travis -0.318734 1.170129 -0.371371 -0.705293 0.831156
mapping = {'a':'red','b':'red','e':'red','c':'blue','d':'blue','f':'orange'}
grouped = data.groupby(mapping,axis=1) #通过字典进行分组
grouped.sum()
blue red
Joe -0.252571 -2.237711
Steve -0.986607 0.382451
Wes 2.350229 1.335047
Jim -3.542213 1.297361
Travis -1.076664 1.682550
map_series = pd.Series(mapping)
map_series
a red
b red
e red
c blue
d blue
f orange
dtype: object
data.groupby(map_series,axis=1).sum()
blue red
Joe -0.252571 -2.237711
Steve -0.986607 0.382451
Wes 2.350229 1.335047
Jim -3.542213 1.297361
Travis -1.076664 1.682550
作为分组键传递的函数将会按照每个索引值调用一次,同时返回值会被用做分组名称
data.groupby(len).sum()
a b c d e
3 -1.156237 0.371164 -0.260120 -1.184435 1.179771
5 0.619238 -1.261862 -0.628191 -0.358416 1.025076
6 -0.318734 1.170129 -0.371371 -0.705293 0.831156
key_list = ['one','one','one','two','two']
data.groupby([len,key_list]).min()
a b c d e
3 one -0.854484 -2.155991 -0.335562 0.082991 -0.265363
two -0.044105 0.072261 -1.562865 -1.979348 1.269205
5 one 0.619238 -1.261862 -0.628191 -0.358416 1.025076
6 two -0.318734 1.170129 -0.371371 -0.705293 0.831156
columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],[1,3,5,1,3]],names=['city','tenor'])
data = pd.DataFrame(np.random.randn(4,5),columns=columns)
city US JP
tenor 1 3 5 1 3
0 -0.423185 0.107952 0.051819 -3.058473 -0.648571
1 0.011324 -1.096422 -0.605934 -0.049505 -0.440209
2 -0.452174 0.876530 1.895581 0.585658 -1.038866
3 -0.087396 -0.666303 0.917700 -0.560344 -0.550330
data.groupby(level='tenor',axis=1).sum()
tenor 1 3 5
0 -3.481657 -0.540619 0.051819
1 -0.038181 -1.536631 -0.605934
2 0.133484 -0.162336 1.895581
3 -0.647739 -1.216634 0.917700
GroupBy对象支持迭代,会生成一个包含组名和数据块的2维元组序列
for name,value in data.groupby('key1'):
print(name)
print(values)
for (k1,k2),values in groupby(['key1','key2']):
print((k1,k2))
print(values)
count 分组中非NAN值的数量
sum 非NAN值的累和
mean 非NAN值的均值
median 非NAN值的算术中位数
std,var 无偏的(n-1分母)标准差,方差
min,max 非NAN值的的最小值,最大值
prod 非NAN值的乘积
first,last 非NAN值的第一个和最后一个值
size 返回包含组大小信息的Series
describe 返回各组的统计信息
quantile 返回各组的样本分位数
如果传递的是函数或函数的列表,将会得到一个列名是这些函数名的DataFrame
grouped.agg(['min','max','mean','std'])
如果不接受GroupBy对象给与各列的名称,可以传递一个(name,function)元组的列表,列名将被修改为name
grouped.agg([(name1,'mean'),(name2,'min')])
传入列名和函数名的字典
grouped.agg({‘key1’:['mean','std','sum'],'key2':['min','max']})
data.gruopby(['key1','key2'],as_index=False) #保持数据原结构
返回不含分组键所形成的分层索引,以及每个原始对象的索引
data.groupby(['key1','key2'],group_keys=False)
def top(df,n=5,columns='key1'):
return df.sort_values(by=columns)[-n:]
data.groupby('key').apply(top.n=?,columns='?') #top函数在DataFrame的每一行分组上被调用,
之后使用pandas.concat将函数结果粘贴一起,并使用分组名作为各组的标签