python中分组函数groupby和分组运算函数agg的使用
导入numpy和pandas库
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
'B': [2, 7, 1, 3, 3, 2, 4, 8],
'C': [100, 87, 96, 130, 105, 87, 96, 155]})
df
|
A |
B |
C |
0 |
a |
2 |
100 |
1 |
b |
7 |
87 |
2 |
a |
1 |
96 |
3 |
c |
3 |
130 |
4 |
a |
3 |
105 |
5 |
c |
2 |
87 |
6 |
b |
4 |
96 |
7 |
c |
8 |
155 |
按任一列分组后求均值
df.groupby('A').mean()
|
B |
C |
A |
|
|
a |
2.000000 |
100.333333 |
b |
5.500000 |
91.500000 |
c |
4.333333 |
124.000000 |
按两列分组后求均值
df.groupby(['A','B']).mean()
|
|
C |
A |
B |
|
a |
1 |
96 |
2 |
100 |
3 |
105 |
b |
4 |
96 |
7 |
87 |
c |
2 |
87 |
3 |
130 |
8 |
155 |
分组后选择列进行计算
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
'B': [2, 7, 1, 3, 3, 2, 4, 8],
'C': [100, 87, 96, 130, 105, 87, 96, 155]})
data=df.groupby('A')
data['B'].std()
data['B','C'].mean()
|
B |
C |
A |
|
|
a |
2.000000 |
100.333333 |
b |
5.500000 |
91.500000 |
c |
4.333333 |
124.000000 |
对一列进行分组后,对不同的列采用不同的聚合方法
data.agg({
'B': 'mean','C':'sum'})
|
B |
C |
A |
|
|
a |
2.000000 |
301 |
b |
5.500000 |
183 |
c |
4.333333 |
372 |
对一列进行分组后,对不同的列采用相同的聚合方法,可以使用apply函数
df.groupby('A').apply(np.mean)
|
B |
C |
A |
|
|
a |
2.000000 |
100.333333 |
b |
5.500000 |
91.500000 |
c |
4.333333 |
124.000000 |
将某列数据按数据值分成不同范围段进行分组运算
np.random.seed(1)
df = pd.DataFrame({
'Age': np.random.randint(20, 70, 100),
'Sex': np.random.choice(['Male', 'Female'], 100),
'number_of_foo': np.random.randint(1, 20, 100)})
df
|
Age |
Sex |
number_of_foo |
0 |
57 |
Female |
18 |
1 |
63 |
Female |
15 |
2 |
32 |
Female |
14 |
3 |
28 |
Female |
12 |
4 |
29 |
Female |
7 |
5 |
31 |
Female |
14 |
6 |
25 |
Male |
16 |
7 |
35 |
Male |
10 |
8 |
20 |
Female |
3 |
9 |
36 |
Female |
8 |
10 |
21 |
Male |
6 |
11 |
32 |
Male |
5 |
12 |
27 |
Female |
6 |
13 |
65 |
Male |
9 |
14 |
26 |
Male |
14 |
15 |
45 |
Male |
18 |
16 |
40 |
Female |
18 |
17 |
57 |
Female |
16 |
18 |
38 |
Female |
14 |
19 |
40 |
Female |
9 |
20 |
31 |
Female |
15 |
21 |
62 |
Male |
14 |
22 |
48 |
Female |
17 |
23 |
49 |
Male |
11 |
24 |
34 |
Female |
14 |
25 |
24 |
Male |
4 |
26 |
43 |
Male |
3 |
27 |
43 |
Male |
15 |
28 |
61 |
Female |
15 |
29 |
69 |
Female |
1 |
... |
... |
... |
... |
70 |
43 |
Male |
18 |
71 |
27 |
Female |
1 |
72 |
46 |
Female |
11 |
73 |
45 |
Male |
11 |
74 |
60 |
Male |
3 |
75 |
42 |
Female |
3 |
76 |
29 |
Female |
18 |
77 |
23 |
Female |
5 |
78 |
59 |
Male |
1 |
79 |
43 |
Male |
15 |
80 |
56 |
Male |
18 |
81 |
47 |
Male |
8 |
82 |
57 |
Male |
4 |
83 |
39 |
Female |
18 |
84 |
58 |
Female |
7 |
85 |
28 |
Female |
10 |
86 |
52 |
Male |
11 |
87 |
54 |
Female |
14 |
88 |
30 |
Female |
7 |
89 |
43 |
Male |
7 |
90 |
35 |
Male |
17 |
91 |
67 |
Male |
17 |
92 |
43 |
Female |
19 |
93 |
45 |
Female |
13 |
94 |
27 |
Male |
10 |
95 |
48 |
Female |
1 |
96 |
30 |
Female |
17 |
97 |
66 |
Female |
10 |
98 |
52 |
Male |
17 |
99 |
44 |
Female |
16 |
100 rows × 3 columns
将age字段分成三组,有如下两种方法实现
第一种方法
bins=4
a = pd.cut(df['Age'], bins=4)
a
0 (56.75, 69.0]
1 (56.75, 69.0]
2 (19.951, 32.25]
3 (19.951, 32.25]
4 (19.951, 32.25]
5 (19.951, 32.25]
6 (19.951, 32.25]
7 (32.25, 44.5]
8 (19.951, 32.25]
9 (32.25, 44.5]
10 (19.951, 32.25]
11 (19.951, 32.25]
12 (19.951, 32.25]
13 (56.75, 69.0]
14 (19.951, 32.25]
15 (44.5, 56.75]
16 (32.25, 44.5]
17 (56.75, 69.0]
18 (32.25, 44.5]
19 (32.25, 44.5]
20 (19.951, 32.25]
21 (56.75, 69.0]
22 (44.5, 56.75]
23 (44.5, 56.75]
24 (32.25, 44.5]
25 (19.951, 32.25]
26 (32.25, 44.5]
27 (32.25, 44.5]
28 (56.75, 69.0]
29 (56.75, 69.0]
...
70 (32.25, 44.5]
71 (19.951, 32.25]
72 (44.5, 56.75]
73 (44.5, 56.75]
74 (56.75, 69.0]
75 (32.25, 44.5]
76 (19.951, 32.25]
77 (19.951, 32.25]
78 (56.75, 69.0]
79 (32.25, 44.5]
80 (44.5, 56.75]
81 (44.5, 56.75]
82 (56.75, 69.0]
83 (32.25, 44.5]
84 (56.75, 69.0]
85 (19.951, 32.25]
86 (44.5, 56.75]
87 (44.5, 56.75]
88 (19.951, 32.25]
89 (32.25, 44.5]
90 (32.25, 44.5]
91 (56.75, 69.0]
92 (32.25, 44.5]
93 (44.5, 56.75]
94 (19.951, 32.25]
95 (44.5, 56.75]
96 (19.951, 32.25]
97 (56.75, 69.0]
98 (44.5, 56.75]
99 (32.25, 44.5]
Name: Age, Length: 100, dtype: category
Categories (4, interval[float64]): [(19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69.0]]
第二种方法
bins=[19, 40, 65, np.inf]
b = pd.cut(df['Age'], bins=[19,40,65,np.inf])
b
0 (40.0, 65.0]
1 (40.0, 65.0]
2 (19.0, 40.0]
3 (19.0, 40.0]
4 (19.0, 40.0]
5 (19.0, 40.0]
6 (19.0, 40.0]
7 (19.0, 40.0]
8 (19.0, 40.0]
9 (19.0, 40.0]
10 (19.0, 40.0]
11 (19.0, 40.0]
12 (19.0, 40.0]
13 (40.0, 65.0]
14 (19.0, 40.0]
15 (40.0, 65.0]
16 (19.0, 40.0]
17 (40.0, 65.0]
18 (19.0, 40.0]
19 (19.0, 40.0]
20 (19.0, 40.0]
21 (40.0, 65.0]
22 (40.0, 65.0]
23 (40.0, 65.0]
24 (19.0, 40.0]
25 (19.0, 40.0]
26 (40.0, 65.0]
27 (40.0, 65.0]
28 (40.0, 65.0]
29 (65.0, inf]
...
70 (40.0, 65.0]
71 (19.0, 40.0]
72 (40.0, 65.0]
73 (40.0, 65.0]
74 (40.0, 65.0]
75 (40.0, 65.0]
76 (19.0, 40.0]
77 (19.0, 40.0]
78 (40.0, 65.0]
79 (40.0, 65.0]
80 (40.0, 65.0]
81 (40.0, 65.0]
82 (40.0, 65.0]
83 (19.0, 40.0]
84 (40.0, 65.0]
85 (19.0, 40.0]
86 (40.0, 65.0]
87 (40.0, 65.0]
88 (19.0, 40.0]
89 (40.0, 65.0]
90 (19.0, 40.0]
91 (65.0, inf]
92 (40.0, 65.0]
93 (40.0, 65.0]
94 (19.0, 40.0]
95 (40.0, 65.0]
96 (19.0, 40.0]
97 (65.0, inf]
98 (40.0, 65.0]
99 (40.0, 65.0]
Name: Age, Length: 100, dtype: category
Categories (3, interval[float64]): [(19.0, 40.0] < (40.0, 65.0] < (65.0, inf]]
age_groups = pd.cut(df['Age'], bins=[19,40,65,np.inf])
df.groupby(age_groups).mean()
|
Age |
number_of_foo |
Age |
|
|
(19.0, 40.0] |
29.382979 |
10.404255 |
(40.0, 65.0] |
51.130435 |
10.652174 |
(65.0, inf] |
67.714286 |
10.428571 |
按‘Age’分组范围和性别(sex)进行制作交叉表
pd.crosstab(age_groups,df['Sex'])
Sex |
Female |
Male |
Age |
|
|
(19.0, 40.0] |
25 |
22 |
(40.0, 65.0] |
19 |
27 |
(65.0, inf] |
3 |
4 |
agg函数的使用
使用groupby按照某列(A)进行分组后,需要对另外一列采用不同的聚合方法
df.groupby('A')['B'].agg({
'mean': np.mean,'std': np.std})
D:\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
"""Entry point for launching an IPython kernel.
|
mean |
std |
A |
|
|
a |
2.000000 |
1.00000 |
b |
5.500000 |
2.12132 |
c |
4.333333 |
3.21455 |
按照某列进行分组后,对不同的列采用不同的聚合方法
df.groupby('A').agg({
'B':[np.mean,'sum'],'C':['count',np.std]})
|
B |
C |
|
mean |
sum |
count |
std |
A |
|
|
|
|
a |
2.000000 |
6 |
3 |
4.509250 |
b |
5.500000 |
11 |
2 |
6.363961 |
c |
4.333333 |
13 |
3 |
34.394767 |
前面两种方法得到的结果是以A列值为索引的结果,如果使用没有进行groupby分组的index的话,该怎么操作呢?此时就要用到transform函数了。transform(func, args, *kwargs) 方法简化了这个过程,: func 参数应用到所有分组,然后把结果放置到原数组的 index 上:
df
|
A |
B |
C |
0 |
a |
2 |
100 |
1 |
b |
7 |
87 |
2 |
a |
1 |
96 |
3 |
c |
3 |
130 |
4 |
a |
3 |
105 |
5 |
c |
2 |
87 |
6 |
b |
4 |
96 |
7 |
c |
8 |
155 |
df.groupby('A')['B','C'].transform('count')
|
B |
C |
0 |
3 |
3 |
1 |
2 |
2 |
2 |
3 |
3 |
3 |
3 |
3 |
4 |
3 |
3 |
5 |
3 |
3 |
6 |
2 |
2 |
7 |
3 |
3 |
从中可以看出:按A列进行分组,对B、C两列进行计数时,B为a的索引有[0,2,4],所以结果列的中[0,2,4]索引的值都为3,相当于广播了。对于C列,同理。