所谓的“分组运算”是多个步骤的一个组合,我们可以拆分为“split-apply-combine”(拆分-应用-合并),我觉得这个词很好的描述了整个过程。分组运算的第一个阶段,pandas对象(无论是Series,DataFrame还是其他的)中的数据会根据你所提供的一个或多个“key”,被拆分(split)为多个组。拆分操作是在对象的特定轴上执行的,例如,DataFrame可以在其行(axis = 0)或者列(axis = 1)上进行分组,然后,将一个函数应用(apply)到各个分组,并产生一个新值。最后,所有的这些函数的执行结果会被合并(combine)到最终的结果对象中。结果对象的形式一般取决于数据上所执行的操作。
注意:apply函数为聚合函数,例如,sum、mean、min、max等
下图展示了分组聚合的过程:
分组的key可以有多种形式,且类型不必相同:
如果觉得上面的东西看起来很抽象,不用担心,我将在下面给出大量示例。首先来看一下下面这个非常简单的表格型数据集(以DataFrame的形式给出)
import pandas as pd
import numpy as np
df = pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
print(df)
运行结果:
key1 key2 data1 data2
0 a one 0.015108 0.304983
1 a two 2.054185 -0.009759
2 b one -1.057348 -1.703048
3 b two -3.696947 -0.788548
4 a one 1.452735 0.388301
grouped = df['data1'].groupby(df['key1'])
print(grouped)
运行结果:
get_mean = grouped.mean()
print(get_mean)
运行结果:
key1
a 1.174009
b -2.377148
Name: data1, dtype: float64
.mean()
的调用过程,这里最重要的是,数据(Series)根据分组键进行了聚合,产生了一个新的Series,其索引为key1列中的唯一值,之所以结果中的索引名称为key1,是因为原始的DataFrame的列df[‘key1’]就叫这个名字。means = df['data1'].groupby([df['key1'],df['key2']]).mean()
print(means)
运行结果:
key1 key2
a one 0.733921
two 2.054185
b one -1.057348
two -3.696947
Name: data1, dtype: float64
print(means.unstack())
运行结果:
key2 one two
key1
a 0.733921 2.054185
b -1.057348 -3.696947
states = np.array(['Ohio','califonia','califonia','Ohio','Ohio'])
years = np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
运行结果:
Ohio 2005 -1.840920
2006 1.452735
califonia 2005 2.054185
2006 -1.057348
Name: data1, dtype: float64
print(df.groupby(['key1']).mean())
运行结果:
data1 data2
key1
a 1.174009 0.227842
b -2.377148 -1.245798
print(df.groupby(['key1','key2']).mean())
运行结果:
data1 data2
key1 key2
a one 0.733921 0.346642
two 2.054185 -0.009759
b one -1.057348 -1.703048
two -3.696947 -0.788548
df.groupby(['key1','key2']).size()
运行结果:
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
注意:目前为止,分组键中的任何缺失值都会被排除在结果之外,但是后面的版本也许会对缺失值进行相应的处理
GroupBy对象支持迭代,可以产生一组二元元祖(由分组和数据块组成)。看看下面的一个简单例子:
for name,group in df.groupby('key1'):
print(name)
print(group)
运行结果:
a
key1 key2 data1 data2
0 a one 0.015108 0.304983
1 a two 2.054185 -0.009759
4 a one 1.452735 0.388301
b
key1 key2 data1 data2
2 b one -1.057348 -1.703048
3 b two -3.696947 -0.788548
for (k1,k2),group in df.groupby(['key1','key2']):
print(k1,k2)
print(group)
运行结果:
a one
key1 key2 data1 data2
0 a one 0.015108 0.304983
4 a one 1.452735 0.388301
a two
key1 key2 data1 data2
1 a two 2.054185 -0.009759
b one
key1 key2 data1 data2
2 b one -1.057348 -1.703048
b two
key1 key2 data1 data2
3 b two -3.696947 -0.788548
pieces = dict(list(df.groupby('key1')))
print("pieces:\n",pieces)
print("pieces['b']:\n",pieces['b'])
运行结果:
pieces:
{'a': key1 key2 data1 data2
0 a one 0.015108 0.304983
1 a two 2.054185 -0.009759
4 a one 1.452735 0.388301, 'b': key1 key2 data1 data2
2 b one -1.057348 -1.703048
3 b two -3.696947 -0.788548}
pieces['b']:
key1 key2 data1 data2
2 b one -1.057348 -1.703048
3 b two -3.696947 -0.788548
grouped = df.groupby(df.dtypes,axis= 1)
dict(list(grouped))
运行结果:
{dtype('float64'): data1 data2
0 0.015108 0.304983
1 2.054185 -0.009759
2 -1.057348 -1.703048
3 -3.696947 -0.788548
4 1.452735 0.388301,
dtype('O'): key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one}
對於由DataFrame產生的GroupBy對象,如果用一個(單個字符串)或一組(字符串數組)列名對其進行索引,就能實現選取部分列進行聚合的目的。也就是說:
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
是以下代码的语法糖:
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
s = df.groupby(['key1','key2'])[['data2']].mean()
print(s)
运行结果:
data2
key1 key2
a one 0.346642
two -0.009759
b one -1.703048
two -0.788548
s_grouped = df.groupby(['key1','key2'])['data2']
print(s_grouped)
print(s_grouped.mean())
运行结果:
key1 key2
a one 0.346642
two -0.009759
b one -1.703048
two -0.788548
Name: data2, dtype: float64
除数组以外,分组信息还可以由其他形式存在。看另一个示例DataFrame:
people = pd.DataFrame(np.random.randn(5,5),
columns=['a','b','c','d','e'],
index=['Joe','Steve','Wes','Jim','Travis'])
people.ix[2:3,['b','c']] = np.nan # 添加几个Nan值
print(people)
"""
a b c d e
Joe 0.485915 -1.271190 0.090832 -0.237905 -1.414645
Steve 0.659409 -0.406590 -0.985230 0.429787 1.351408
Wes -1.043782 NaN NaN 0.379168 0.095054
Jim 0.059189 -0.966218 -1.253383 1.774299 1.461221
Travis 1.702478 0.331087 0.568426 -0.985880 0.586774
"""
# 定义一个字典映射
mapping = {'a':'red','b':'red','c':'blue','d':'blue','e':'red','f':'orange'}
# 根据映射去进行分组
by_colum = people.groupby(mapping,axis=1)
# 分组后求和
print(by_colum.sum())
"""
blue red
Joe -0.147073 -2.199920
Steve -0.555443 1.604226
Wes 0.379168 -0.948727
Jim 0.520916 0.554192
Travis -0.417454 2.620338
"""
map_series = pd.Series(mapping)
by_series = people.groupby(map_series,axis= 1).count()
print(by_series)
"""
blue red
Joe 2 3
Steve 2 3
Wes 1 2
Jim 2 3
Travis 2 3
"""
相较于字典或者Series,Python函数在定义分组映射关系时,可以更有创意且更为抽象。任何被当做分组键的函数都会在各个索引值上被调用一次,其返回值就会被用作分组名称,具体点说,以上一小节的示例DataFrame为例,其索引值为人的名字。假设你希望根据人名的长度进行分组,虽然可以求取一个字符串长度数组,但其实仅仅传入len函数就可以了:
s = people.groupby(len).sum()
print(s)
"""
a b c d e
3 -0.498678 -2.237408 -1.162551 1.915562 0.141630
5 0.659409 -0.406590 -0.985230 0.429787 1.351408
6 1.702478 0.331087 0.568426 -0.985880 0.586774
"""
key_list = ['one','one','one','two','two']
s = people.groupby([len,key_list]).min()
print(s)
"""
a b c d e
3 one -1.043782 -1.271190 0.090832 -0.237905 -1.414645
two 0.059189 -0.966218 -1.253383 1.774299 1.461221
5 one 0.659409 -0.406590 -0.985230 0.429787 1.351408
6 two 1.702478 0.331087 0.568426 -0.985880 0.586774
"""
层次化索引数据集最方便的地方就在于它能够根据索引级别进行聚合。要实现该目的,通过level关键字传入级别编号或名称即可:
columns = pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
[1,3,5,1,3]],
names = ['city','tenor'])
hier_df = pd.DataFrame(np.random.randn(4,5),columns = columns)
print(hier_df)
"""
city US JP
tenor 1 3 5 1 3
0 0.700936 1.110087 -1.058324 1.316239 -0.940866
1 0.193931 1.845167 0.191146 -1.081856 -1.396286
2 -0.533317 0.021661 1.216783 -0.777967 -1.105844
3 0.759767 -0.599406 -1.145386 -0.675289 0.465727
"""
s = hier_df.groupby(level='city',axis=1).count()
print(s)
"""
city JP US
0 2 3
1 2 3
2 2 3
3 2 3
"""
这就是这个GroupBy得使用方法,重点在“key”的使用,本篇演示了多种key的定义方式,可以根据不同的业务需求选择!