import pandas as pd
df = pd.DataFrame({'Country':['China','China', 'India', 'India', 'America', 'Japan', 'China', 'India'],
'Income':[10000, 10000, 5000, 5002, 40000, 50000, 8000, 5000],
'Age':[5000, 4321, 1234, 4010, 250, 250, 4500, 4321]})
df
|
Age |
Country |
Income |
0 |
5000 |
China |
10000 |
1 |
4321 |
China |
10000 |
2 |
1234 |
India |
5000 |
3 |
4010 |
India |
5002 |
4 |
250 |
America |
40000 |
5 |
250 |
Japan |
50000 |
6 |
4500 |
China |
8000 |
7 |
4321 |
India |
5000 |
主要就是理解下面这句话!!!后面的是一些细节!!
一个基础的语法!!!
a=df.groupby('Country', as_index = False)['Income'].agg({ 'Income_sum':'sum'})
a
|
Country |
Income_sum |
0 |
America |
40000 |
1 |
China |
28000 |
2 |
India |
15002 |
3 |
Japan |
50000 |
上面代码的意思:统计每一种Conutry下Income的总和!!!
具体起到的作用:
①groupby(‘Country’, as_index = False) :按每一种Country进行统计!!!!!
②[‘Income’]:统计的是每一种Country下Income的相关特征!!!
③agg函数中 ‘Income_sum’:‘sum’,第一个字段’Income_sum’,其实你也可以写’sum’,但是如果写’Income_sum’,pandas会将
“每一个country下的Income总和”这个特征的特证名添为’Income_sum’
可以看到的是,这样造出来的 Income_sum特征,需要拼接回原来的df上去
也就是
data_merged = pd.merge(df, a, on=['Country'], how='left')
data_merged
|
Age |
Country |
Income |
Income_sum |
0 |
5000 |
China |
10000 |
28000 |
1 |
4321 |
China |
10000 |
28000 |
2 |
1234 |
India |
5000 |
15002 |
3 |
4010 |
India |
5002 |
15002 |
4 |
250 |
America |
40000 |
40000 |
5 |
250 |
Japan |
50000 |
50000 |
6 |
4500 |
China |
8000 |
28000 |
7 |
4321 |
India |
5000 |
15002 |
比赛中经常用到的写法
df.groupby('Country').Income.transform('sum')
0 28000
1 28000
2 15002
3 15002
4 40000
5 50000
6 28000
7 15002
Name: Income, dtype: int64
可以看到,上面的数据,如第0行,和China的Income总和是一样的,而
df的第0行,实际上也是China!!,也就是说,他已经按Country字段给merge好了!!!
-------------------------------------------------分割线,下面是原理和细节
a=df.groupby('Country')
a.max()
|
Age |
Income |
Country |
|
|
America |
250 |
40000 |
China |
5000 |
10000 |
India |
4321 |
5002 |
Japan |
250 |
50000 |
a.agg(['min', 'mean', 'max'])
|
Age |
Income |
|
min |
mean |
max |
min |
mean |
max |
Country |
|
|
|
|
|
|
America |
250 |
250.000000 |
250 |
40000 |
40000.000000 |
40000 |
China |
4321 |
4607.000000 |
5000 |
8000 |
9333.333333 |
10000 |
India |
1234 |
3188.333333 |
4321 |
5000 |
5000.666667 |
5002 |
Japan |
250 |
250.000000 |
250 |
50000 |
50000.000000 |
50000 |
下面这个就是单独的取了一个字段的聚合结果
下面的结果表示,a[‘Age’]是单独的一个字段的group,对单独的一个字段的group我们可以对其聚合进行改名
比如下面 num_agg = {‘p’:[‘min’, ‘mean’, ‘max’]} 其实那个’p’按常规来说应该写成Age,但是此处这么写,就相当于将表中的Age字段直接更名为’p’了
num_agg = {'p':['min', 'mean', 'max']}
a['Age'].agg(num_agg)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
|
p |
|
min |
mean |
max |
Country |
|
|
|
America |
250 |
250.000000 |
250 |
China |
4321 |
4607.000000 |
5000 |
India |
1234 |
3188.333333 |
4321 |
Japan |
250 |
250.000000 |
250 |
要注意 a是一个包含两个字段的(Income和Age)的group,
agg函数不支持对含两个以上字段的聚合进行名字更改(如下图)
num_agg = {'Age_1':['min', 'mean', 'max']}
a.agg(num_agg)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
in ()
1 num_agg = {'Age_1':['min', 'mean', 'max']}
----> 2 a.agg(num_agg)###### a包含了两个以上的字段了!
~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
1313 @Appender(_shared_docs['aggregate'])
1314 def aggregate(self, arg, *args, **kwargs):
-> 1315 return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
1316
1317 agg = aggregate
~/.local/lib/python3.5/site-packages/pandas/core/groupby/generic.py in aggregate(self, arg, *args, **kwargs)
184
185 _level = kwargs.pop('_level', None)
--> 186 result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
187 if how is None:
188 return result
~/.local/lib/python3.5/site-packages/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
408 k not in obj.columns):
409 raise KeyError(
--> 410 "Column '{col}' does not exist!".format(col=k))
411
412 arg = new_arg
KeyError: "Column 'Age_1' does not exist!"
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
a.agg(num_agg,as_index=False)
|
Income |
Age |
|
min |
max |
min |
mean |
max |
Country |
|
|
|
|
|
America |
40000 |
40000 |
250 |
250.000000 |
250 |
China |
8000 |
10000 |
4321 |
4607.000000 |
5000 |
India |
5000 |
5002 |
1234 |
3188.333333 |
4321 |
Japan |
50000 |
50000 |
250 |
250.000000 |
250 |
num_agg = {'Age':['min', 'mean', 'max'], 'Income':['min', 'max']}
a.agg(num_agg,as_index=True)
|
Income |
Age |
|
min |
max |
min |
mean |
max |
Country |
|
|
|
|
|
America |
40000 |
40000 |
250 |
250.000000 |
250 |
China |
8000 |
10000 |
4321 |
4607.000000 |
5000 |
India |
5000 |
5002 |
1234 |
3188.333333 |
4321 |
Japan |
50000 |
50000 |
250 |
250.000000 |
250 |
num_agg = {'Age':['min', 'mean', 'max']}
a.agg(num_agg,as_index=False)
|
Age |
|
min |
mean |
max |
Country |
|
|
|
America |
250 |
250.000000 |
250 |
China |
4321 |
4607.000000 |
5000 |
India |
1234 |
3188.333333 |
4321 |
Japan |
250 |
250.000000 |
250 |
num_agg = {'Age':['min', 'mean', 'max']}
a['Age'].agg(num_agg,as_index=True)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/ipykernel_launcher.py:2: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
|
Age |
|
min |
mean |
max |
Country |
|
|
|
America |
250 |
250.000000 |
250 |
China |
4321 |
4607.000000 |
5000 |
India |
1234 |
3188.333333 |
4321 |
Japan |
250 |
250.000000 |
250 |
num_agg = {'min', 'mean', 'max'}
a['Age'].agg(num_agg,as_index=False)
|
min |
max |
mean |
Country |
|
|
|
America |
250 |
250 |
250.000000 |
China |
4321 |
5000 |
4607.000000 |
India |
1234 |
4321 |
3188.333333 |
Japan |
250 |
250 |
250.000000 |