SQL中求字段的最大值、中位数、计数等,经常会用到如下形式的查询语句:
select
colE,
max(colA) as A1, min(colA) as A2, median(colA) as A3, sum(colA) as A4,
count(distinct colC) as C1, count(colC) as C2,
count(case when colC = 0 then 1 end) as C3
from tableT
group by colE;
在pandas中可以利用聚合函数agg()来实现上述语句。
agg是aggregate的别名。
DataFrame.agg(func=None, axis=0, *args, **kwargs)
返回值:scalar,Series 或 DataFrame
数据如下:
import numpy as np
import pandas as pd
data = [['AA',1,2],['BB',-3,12],['CC',0,6],['DD',7,9],['BB',3,5],['CC',4,9]]
df = pd.DataFrame(data, columns=['NAME','VAL1','VAL2'])
## 结果
NAME VAL1 VAL2
0 AA 1 2
1 BB -3 12
2 CC 0 6
3 DD 7 9
4 BB 3 5
5 CC 4 9
1、agg({‘column_nam’: func}) 的形式
df.agg({'VAL1':[np.max, np.min, pd.Series.nunique], 'VAL2':[np.max, np.min, np.mean]})
## 结果 ##
VAL1 VAL2
amax 7.0 12.000000
amin -3.0 2.000000
mean NaN 7.166667
nunique 6.0 NaN
2、df['column_name].agg({func}) 的形式
# 最大值,最小值,均值,求和,中位数,不同值的个数
df.groupby('NAME')[['VAL1','VAL2']].agg({np.max, np.min, np.mean, np.sum, np.median, pd.Series.nunique})
## 结果 ##
VAL1 VAL2
median nunique amax mean amin sum median nunique amax mean amin sum
NAME
AA 1 1 1 1 1 1 2.0 1 2 2.0 2 2
BB 0 2 3 0 -3 0 8.5 2 12 8.5 5 17
CC 2 2 4 2 0 4 7.5 2 9 7.5 6 15
DD 7 1 7 7 7 7 9.0 1 9 9.0 9 9
3、自定义函数 lambda 形式
# 最小值
v2 = lambda x: x.min()
v2.__name__ = 'va'
df.agg({'VAL1':[np.max, np.min, v2], 'VAL2':[np.mean, v2, np.max, np.min]})
## 结果 ##
VAL1 VAL2
amax 7.0 12.000000
amin -3.0 2.000000
mean NaN 7.166667
va -3.0 2.000000
## 注意:若不设置函数的名称 v2.__name__,最后显示的结果为
VAL1 VAL2
<lambda> -3.0 2.000000
amax 7.0 12.000000
amin -3.0 2.000000
mean NaN 7.166667