groupby函数分组原理:
(1)不论分组键是数组、列表、字典、series、函数,只要待分组变量与分组键值的长度一致,都可以用groupby分组;
(2)分组可以按行或者按列进行,axis=0表示按行分组,axis=1表示按列分组,默认按行分组;
(3)对于分好的每个组,可以通过函数计算,python自带的或自定义的函数都行;
(4)将计算结果再聚合到一起输出。
下面通过例子对groupby函数进行具体说明:
创建一个dataFrame例子:
import numpy as np
import pandas as pd
def GroupbyDemo():
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
print(df)
if __name__ == '__main__':
GroupbyDemo()
打印输出结果:
key1 key2 data1 data2
0 a one 0.921248 1.090957
1 a two 0.211169 -1.826231
2 b one 0.058034 0.978667
3 b two 0.163153 0.835136
4 a one -0.231977 0.645021
(1)将key1作为分组键值,对data1进行分组,再求每组的均值
grouped = df['data1'].groupby(df['key1']).mean()
得到结果为:
key1
a 0.924545
b -0.148181
grouped = df['data1'].groupby(df['key1'])
for i in grouped:
print(i)
打印输出分组结果,分组结果类型为元祖
(2)将key1和key2都作为分组键值对data1进行分组,并求均值
grouped = df['data1'].groupby([df['key1'],df['key2']]).mean()
得到结果为:
key1 key2
a one -0.276938
two 1.882745
b one -0.679474
two -0.269018
上述分组都是按行分组的情况,下面阐述按列分组的情况:
创建一个含列key的dataFrame
import numpy as np
import pandas as pd
def GroupbyDemo():
df = pd.DataFrame({'key1': [1, 2, 3, 4, 5],
'key2': [10, 20, 30, 40, 50],
'data1': np.random.randn(5),
'data2': np.random.randn(5)},index=['joe','steve','wes','jim','travis'])
print(df)
if __name__ == '__main__':
GroupbyDemo()
打印输出:
key1 key2 data1 data2
joe 1 10 1.467131 0.760701
steve 2 20 1.631652 1.518505
wes 3 30 -0.058462 -0.244320
jim 4 40 -0.595540 -2.083987
travis 5 50 -0.587168 0.795081
(1)按列分组:
groupBy = {'key1': 'red', 'key2': 'red', 'data1': 'blue',
'data2': 'blue'}
grouped = df.groupby(groupBy, axis=1).mean()
print(grouped)
打印输出:
blue red
joe 0.016355 5.5
steve 0.379583 11.0
wes 0.474951 16.5
jim 0.692162 22.0
travis -1.670801 27.5
使用自定义函数计算分组值:
import numpy as np
import pandas as pd
def GroupbyDemo():
df = pd.DataFrame({'key1': [1, 2, 1, 2, 1],
'key2': [10, 20, 30, 40, 50],
'data1': np.random.randn(5),
'data2': np.random.randn(5)},index=['joe','steve','wes','jim','travis'])
print(df)
grouped = df['data1'].groupby(df['key1']).agg(peak_peak)
print("#################################################")
print(grouped)
def peak_peak(arr):
return arr.max() - arr.min()
if __name__ == '__main__':
GroupbyDemo()
打印结果:
key1 key2 data1 data2
joe 1 10 -1.064144 -1.419688
steve 2 20 -0.191633 -0.254214
wes 1 30 0.911625 -1.258709
jim 2 40 0.100250 0.445733
travis 1 50 -0.980806 1.710197
#################################################
key1
1 1.975770
2 0.291883
Name: data1, dtype: float64