创建数据:
#!/usr/bin/python
import pandas as pd
import numpy as np
data = {'A': ['a', 'b', 'a', 'c', 'a', 'c', 'b', 'c'],
'B': [2, 8, 1, 4, 3, 2, 5, 9],
'C': [102, 98, 107, 104, 115, 87, 92, 123]
}
df = pd.DataFrame(data=data)
print("df数据打印:\n", df, '\n')
out:
df数据打印:
A B C
0 a 2 102
1 b 8 98
2 a 1 107
3 c 4 104
4 a 3 115
5 c 2 87
6 b 5 92
7 c 9 123
按照A列进行分组,获取其他列的均值:
print(df.groupby('A').mean())
out:
B C
A
a 2.0 108.000000
b 6.5 95.000000
c 5.0 104.666667
print(df.groupby(['A', 'B']).mean())
out:
C
A B
a 1 107
2 102
3 115
b 5 92
8 98
c 2 87
4 104
9 123
分组后选择列进行运算:
df = pd.DataFrame([[1, 1, 2], [1, 2, 3], [2, 3, 4]], columns=["A", "B", "C"])
print("df数据打印:\n", df, '\n')
out:
df数据打印:
A B C
0 1 1 2
1 1 2 3
2 2 3 4
按照A列进行分组,显示B列的值:
print(df.groupby(['A'])['B'].mean())
out:
A
1 1.5
2 3.0
Name: B, dtype: float64
按照A列进行分组,选择多列:
print(df.groupby(['A'])['B','C'].mean())
out:
B C
A
1 1.5 2.5
2 3.0 4.0
print(g.agg({'B':'mean', 'C':'sum'}))
out:
B C
A
1 1.5 5
2 3.0 4
聚合方法:
聚合方法有 size() 和 count() 。
size 跟 count 的区别: size 计数时包含 NaN 值,而 count 不包含 NaN 值
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name":["Alice", "Bob", "Mallory", "Mallory", "Bob"
, "Mallory"],
"City":["Seattle", "Seattle", "Portland", "Seattle",
"Seattle", "Portland"],
"Val":[4,3,3,np.nan,np.nan,4]})
print(df, '\n')
print("df数据打印:\n", df.groupby(["Name", "City"], as_index=False).count(), '\n')
print("df size数据打印:\n", df.groupby(["Name", "City"])['Val'].size().reset_index(name='Size'), '\n')
out:
City Name Val
0 Seattle Alice 4.0
1 Seattle Bob 3.0
2 Portland Mallory 3.0
3 Seattle Mallory NaN
4 Seattle Bob NaN
5 Portland Mallory 4.0
df size数据打印:
Name City Val
0 Alice Seattle 1
1 Bob Seattle 1
2 Mallory Portland 2
3 Mallory Seattle 0
df size数据打印:
Name City Size
0 Alice Seattle 1
1 Bob Seattle 2
2 Mallory Portland 2
3 Mallory Seattle 1
分组运算方法:
df = pd.DataFrame({'A': list('XYZXYZXYZX'), 'B': [1, 2, 1, 3, 1, 2, 3,
3, 1, 2],'C': [12, 14, 11, 12, 13, 14, 16, 12, 10,19]})
print(df, '\n')
print("df 数据打印:\n", df.groupby('A')['B'].agg({'mean':np.mean, 'standard deviation':np.std}), '\n')
out:
A B C
0 X 1 12
1 Y 2 14
2 Z 1 11
3 X 3 12
4 Y 1 13
5 Z 2 14
6 X 3 16
7 Y 3 12
8 Z 1 10
9 X 2 19
df 数据打印:
standard deviation mean
A
X 0.957427 2.250000
Y 1.000000 2.000000
Z 0.577350 1.333333
针对不同的列应用多种不同的统计方法:
print("df 数据打印:\n", df.groupby('A').agg({'B':[np.mean, 'sum'], 'C':['count', np.std]}), '\n')
out:
df 数据打印:
B C
mean sum count std
A
X 2.250000 9 4 3.403430
Y 2.000000 6 3 1.000000
Z 1.333333 4 3 2.081666
分组运算方法apply():
print("df 数据打印:\n", df.groupby('A').apply(np.mean), '\n')
out:
df 数据打印:
B C
A
X 2.250000 14.750000
Y 2.000000 13.000000
Z 1.333333 11.666667