python中分组函数groupby和分组运算函数agg的使用

python中分组函数groupby和分组运算函数agg的使用

导入numpy和pandas库

import pandas as pd
import numpy as np
df = pd.DataFrame({
     'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'], 
                       'B': [2, 7, 1, 3, 3, 2, 4, 8], 
                       'C': [100, 87, 96, 130, 105, 87, 96, 155]})
df
A B C
0 a 2 100
1 b 7 87
2 a 1 96
3 c 3 130
4 a 3 105
5 c 2 87
6 b 4 96
7 c 8 155

按任一列分组后求均值

df.groupby('A').mean()#按A进行分组后对其求均值
B C
A
a 2.000000 100.333333
b 5.500000 91.500000
c 4.333333 124.000000

按两列分组后求均值

df.groupby(['A','B']).mean()
C
A B
a 1 96
2 100
3 105
b 4 96
7 87
c 2 87
3 130
8 155

分组后选择列进行计算

import pandas as pd
import numpy as np
df = pd.DataFrame({
     'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'], 
                       'B': [2, 7, 1, 3, 3, 2, 4, 8], 
                       'C': [100, 87, 96, 130, 105, 87, 96, 155]})
data=df.groupby('A')#选择A这一列进行操作
data['B'].std()


data['B','C'].mean()#对B、C两列求均值
B C
A
a 2.000000 100.333333
b 5.500000 91.500000
c 4.333333 124.000000

对一列进行分组后,对不同的列采用不同的聚合方法

data.agg({
     'B': 'mean','C':'sum'})#分别对B列求均值和对C列求和
B C
A
a 2.000000 301
b 5.500000 183
c 4.333333 372

对一列进行分组后,对不同的列采用相同的聚合方法,可以使用apply函数

df.groupby('A').apply(np.mean)
B C
A
a 2.000000 100.333333
b 5.500000 91.500000
c 4.333333 124.000000

将某列数据按数据值分成不同范围段进行分组运算

np.random.seed(1)
df = pd.DataFrame({
     'Age': np.random.randint(20, 70, 100), 
                        'Sex': np.random.choice(['Male', 'Female'], 100), 
                        'number_of_foo': np.random.randint(1, 20, 100)})

df
Age Sex number_of_foo
0 57 Female 18
1 63 Female 15
2 32 Female 14
3 28 Female 12
4 29 Female 7
5 31 Female 14
6 25 Male 16
7 35 Male 10
8 20 Female 3
9 36 Female 8
10 21 Male 6
11 32 Male 5
12 27 Female 6
13 65 Male 9
14 26 Male 14
15 45 Male 18
16 40 Female 18
17 57 Female 16
18 38 Female 14
19 40 Female 9
20 31 Female 15
21 62 Male 14
22 48 Female 17
23 49 Male 11
24 34 Female 14
25 24 Male 4
26 43 Male 3
27 43 Male 15
28 61 Female 15
29 69 Female 1
... ... ... ...
70 43 Male 18
71 27 Female 1
72 46 Female 11
73 45 Male 11
74 60 Male 3
75 42 Female 3
76 29 Female 18
77 23 Female 5
78 59 Male 1
79 43 Male 15
80 56 Male 18
81 47 Male 8
82 57 Male 4
83 39 Female 18
84 58 Female 7
85 28 Female 10
86 52 Male 11
87 54 Female 14
88 30 Female 7
89 43 Male 7
90 35 Male 17
91 67 Male 17
92 43 Female 19
93 45 Female 13
94 27 Male 10
95 48 Female 1
96 30 Female 17
97 66 Female 10
98 52 Male 17
99 44 Female 16

100 rows × 3 columns

将age字段分成三组,有如下两种方法实现

第一种方法

#第一种方法:
bins=4
a = pd.cut(df['Age'], bins=4)
a
0       (56.75, 69.0]
1       (56.75, 69.0]
2     (19.951, 32.25]
3     (19.951, 32.25]
4     (19.951, 32.25]
5     (19.951, 32.25]
6     (19.951, 32.25]
7       (32.25, 44.5]
8     (19.951, 32.25]
9       (32.25, 44.5]
10    (19.951, 32.25]
11    (19.951, 32.25]
12    (19.951, 32.25]
13      (56.75, 69.0]
14    (19.951, 32.25]
15      (44.5, 56.75]
16      (32.25, 44.5]
17      (56.75, 69.0]
18      (32.25, 44.5]
19      (32.25, 44.5]
20    (19.951, 32.25]
21      (56.75, 69.0]
22      (44.5, 56.75]
23      (44.5, 56.75]
24      (32.25, 44.5]
25    (19.951, 32.25]
26      (32.25, 44.5]
27      (32.25, 44.5]
28      (56.75, 69.0]
29      (56.75, 69.0]
           ...       
70      (32.25, 44.5]
71    (19.951, 32.25]
72      (44.5, 56.75]
73      (44.5, 56.75]
74      (56.75, 69.0]
75      (32.25, 44.5]
76    (19.951, 32.25]
77    (19.951, 32.25]
78      (56.75, 69.0]
79      (32.25, 44.5]
80      (44.5, 56.75]
81      (44.5, 56.75]
82      (56.75, 69.0]
83      (32.25, 44.5]
84      (56.75, 69.0]
85    (19.951, 32.25]
86      (44.5, 56.75]
87      (44.5, 56.75]
88    (19.951, 32.25]
89      (32.25, 44.5]
90      (32.25, 44.5]
91      (56.75, 69.0]
92      (32.25, 44.5]
93      (44.5, 56.75]
94    (19.951, 32.25]
95      (44.5, 56.75]
96    (19.951, 32.25]
97      (56.75, 69.0]
98      (44.5, 56.75]
99      (32.25, 44.5]
Name: Age, Length: 100, dtype: category
Categories (4, interval[float64]): [(19.951, 32.25] < (32.25, 44.5] < (44.5, 56.75] < (56.75, 69.0]]

第二种方法

bins=[19, 40, 65, np.inf]
b = pd.cut(df['Age'], bins=[19,40,65,np.inf])
b
0     (40.0, 65.0]
1     (40.0, 65.0]
2     (19.0, 40.0]
3     (19.0, 40.0]
4     (19.0, 40.0]
5     (19.0, 40.0]
6     (19.0, 40.0]
7     (19.0, 40.0]
8     (19.0, 40.0]
9     (19.0, 40.0]
10    (19.0, 40.0]
11    (19.0, 40.0]
12    (19.0, 40.0]
13    (40.0, 65.0]
14    (19.0, 40.0]
15    (40.0, 65.0]
16    (19.0, 40.0]
17    (40.0, 65.0]
18    (19.0, 40.0]
19    (19.0, 40.0]
20    (19.0, 40.0]
21    (40.0, 65.0]
22    (40.0, 65.0]
23    (40.0, 65.0]
24    (19.0, 40.0]
25    (19.0, 40.0]
26    (40.0, 65.0]
27    (40.0, 65.0]
28    (40.0, 65.0]
29     (65.0, inf]
          ...     
70    (40.0, 65.0]
71    (19.0, 40.0]
72    (40.0, 65.0]
73    (40.0, 65.0]
74    (40.0, 65.0]
75    (40.0, 65.0]
76    (19.0, 40.0]
77    (19.0, 40.0]
78    (40.0, 65.0]
79    (40.0, 65.0]
80    (40.0, 65.0]
81    (40.0, 65.0]
82    (40.0, 65.0]
83    (19.0, 40.0]
84    (40.0, 65.0]
85    (19.0, 40.0]
86    (40.0, 65.0]
87    (40.0, 65.0]
88    (19.0, 40.0]
89    (40.0, 65.0]
90    (19.0, 40.0]
91     (65.0, inf]
92    (40.0, 65.0]
93    (40.0, 65.0]
94    (19.0, 40.0]
95    (40.0, 65.0]
96    (19.0, 40.0]
97     (65.0, inf]
98    (40.0, 65.0]
99    (40.0, 65.0]
Name: Age, Length: 100, dtype: category
Categories (3, interval[float64]): [(19.0, 40.0] < (40.0, 65.0] < (65.0, inf]]
#分组范围结果如下:
age_groups = pd.cut(df['Age'], bins=[19,40,65,np.inf])
df.groupby(age_groups).mean()
Age number_of_foo
Age
(19.0, 40.0] 29.382979 10.404255
(40.0, 65.0] 51.130435 10.652174
(65.0, inf] 67.714286 10.428571

按‘Age’分组范围和性别(sex)进行制作交叉表

pd.crosstab(age_groups,df['Sex'])#使用制作交叉表的函数
Sex Female Male
Age
(19.0, 40.0] 25 22
(40.0, 65.0] 19 27
(65.0, inf] 3 4

agg函数的使用

使用groupby按照某列(A)进行分组后,需要对另外一列采用不同的聚合方法

df.groupby('A')['B'].agg({
     'mean': np.mean,'std': np.std})
D:\anaconda\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
  """Entry point for launching an IPython kernel.
mean std
A
a 2.000000 1.00000
b 5.500000 2.12132
c 4.333333 3.21455

按照某列进行分组后,对不同的列采用不同的聚合方法

df.groupby('A').agg({
     'B':[np.mean,'sum'],'C':['count',np.std]})#不同的列使用不同的聚合方法,使用列表将两种不同的方式放到一块。
B C
mean sum count std
A
a 2.000000 6 3 4.509250
b 5.500000 11 2 6.363961
c 4.333333 13 3 34.394767

前面两种方法得到的结果是以A列值为索引的结果,如果使用没有进行groupby分组的index的话,该怎么操作呢?此时就要用到transform函数了。transform(func, args, *kwargs) 方法简化了这个过程,: func 参数应用到所有分组,然后把结果放置到原数组的 index 上:

df
A B C
0 a 2 100
1 b 7 87
2 a 1 96
3 c 3 130
4 a 3 105
5 c 2 87
6 b 4 96
7 c 8 155
df.groupby('A')['B','C'].transform('count')  #注:count函数在计算时,不计算nan值
B C
0 3 3
1 2 2
2 3 3
3 3 3
4 3 3
5 3 3
6 2 2
7 3 3

从中可以看出:按A列进行分组,对B、C两列进行计数时,B为a的索引有[0,2,4],所以结果列的中[0,2,4]索引的值都为3,相当于广播了。对于C列,同理。

你可能感兴趣的:(python,python,numpy,pandas)