目录
前言
apply:一般性的‘拆份-应用-合并’
禁止分组建
分位数和桶分析
透视表
交叉表:crosstab
示例:用特定于分组的值填充缺失值
示例:随机采样和队列
示例:分组加权平均数和相关系数
假设我们为DataFrame添加用于存放各索引分组平均值的列,一个办法是先聚合在合并。
>>> k1_means = df.groupby('key1').mean().add_prefix('mean_')
>>> k1_means
mean_data1 mean_data2
key1
a -0.380460 -0.332537
b -0.314586 -0.605574
>>> pd.merge(df,k1_means,left_on='key1',right_index=True)
data1 data2 key1 key2 mean_data1 mean_data2
0 -0.291328 0.257737 a one -0.380460 -0.332537
1 -1.390843 -1.081238 a two -0.380460 -0.332537
4 0.540790 -0.174112 a one -0.380460 -0.332537
2 0.574857 0.202979 b one -0.314586 -0.605574
3 -1.204029 -1.414127 b two -0.314586 -0.605574
这次我们在GroupBy上使用transForm方法。
transForm或将一个函数应用到各个分组,然后将结果放到适当的位置上。
>>> people = DataFrame(np.random.randn(5, 5), columns=['a', 'b', 'c', 'd', 'e'],
... index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])
>>> people.groupby(key).mean()
a b c d e
one 0.684081 0.110111 -0.122685 -0.392944 0.676586
two 0.295614 -0.488849 0.111023 -0.452018 -0.593795
>>> people.groupby(key).transform(np.mean)
a b c d e
Joe 0.684081 0.110111 -0.122685 -0.392944 0.676586
Steve 0.295614 -0.488849 0.111023 -0.452018 -0.593795
Wes 0.684081 0.110111 -0.122685 -0.392944 0.676586
Jim 0.295614 -0.488849 0.111023 -0.452018 -0.593795
Travis 0.684081 0.110111 -0.122685 -0.392944 0.676586
假如你希望从各组中减去平均值,为此我们先创建一个距平化函数,然后将其传给transform
>>> def demean(arr):
... return arr - arr.mean()
...
>>> demeaned =people.groupby(key).transform(demean)
>>> demeaned
a b c d e
Joe -0.779960 0.893851 -1.448675 -0.091887 -0.162785
Steve -0.323736 0.072072 0.659981 -0.131960 -0.498387
Wes 0.305050 -1.817776 0.450697 -0.454107 -0.952844
Jim 0.323736 -0.072072 -0.659981 0.131960 0.498387
Travis 0.474909 0.923925 0.997978 0.545994 1.115629
你可以检查一下demeaned各组的平均值是否为0
假设你想要根据分组选出5个最高的tip_pct值,首先先写一个指定列具有最大值的行的函数
>>> def top(df,n=5,columns='tip_pct'):
... return df.sort_index(by=columns)[-n:]
>>> top(tips,n=6)
total_bill tip smoker day time size tip_pct
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
232 11.61 3.39 No Sat Dinner 2 0.291990
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
top涵数在DataFrame的个个片段上调用,最后由pandas.concat组装到一起。
>>> tips.groupby(['smoker','day']).apply(top)
total_bill tip smoker day time size tip_pct
smoker day
No Fri 99 12.46 1.50 No Fri Dinner 2 0.120385
94 22.75 3.25 No Fri Dinner 2 0.142857
91 22.49 3.50 No Fri Dinner 2 0.155625
223 15.98 3.00 No Fri Lunch 3 0.187735
Sat 228 13.28 2.72 No Sat Dinner 2 0.204819
108 18.24 3.76 No Sat Dinner 2 0.206140
110 14.00 3.00 No Sat Dinner 2 0.214286
20 17.92 4.08 No Sat Dinner 2 0.227679
232 11.61 3.39 No Sat Dinner 2 0.291990
Sun 46 22.23 5.00 No Sun Dinner 2 0.224921
17 16.29 3.71 No Sun Dinner 3 0.227747
6 8.77 2.00 No Sun Dinner 2 0.228050
185 20.69 5.00 No Sun Dinner 5 0.241663
51 10.29 2.60 No Sun Dinner 2 0.252672
Thur 81 16.66 3.40 No Thur Lunch 2 0.204082
139 13.16 2.75 No Thur Lunch 2 0.208967
87 18.28 4.00 No Thur Lunch 2 0.218818
88 24.71 5.85 No Thur Lunch 2 0.236746
149 7.51 2.00 No Thur Lunch 2 0.266312
Yes Fri 226 10.09 2.00 Yes Fri Lunch 2 0.198216
100 11.35 2.50 Yes Fri Dinner 2 0.220264
222 8.58 1.92 Yes Fri Lunch 1 0.223776
221 13.42 3.48 Yes Fri Lunch 2 0.259314
93 16.32 4.30 Yes Fri Dinner 2 0.263480
Sat 171 15.81 3.16 Yes Sat Dinner 2 0.199873
63 18.29 3.76 Yes Sat Dinner 4 0.205577
214 28.17 6.50 Yes Sat Dinner 3 0.230742
109 14.31 4.00 Yes Sat Dinner 2 0.279525
67 3.07 1.00 Yes Sat Dinner 1 0.325733
Sun 174 16.82 4.00 Yes Sun Dinner 2 0.237812
181 23.33 5.65 Yes Sun Dinner 2 0.242177
183 23.17 6.50 Yes Sun Dinner 4 0.280535
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
Thur 204 20.53 4.00 Yes Thur Lunch 4 0.194837
205 16.47 3.23 Yes Thur Lunch 3 0.196114
191 19.81 4.19 Yes Thur Lunch 2 0.211509
200 18.71 4.00 Yes Thur Lunch 3 0.213789
194 16.58 4.00 Yes Thur Lunch 2 0.241255
>>> tips.groupby('smoker',group_keys=False).apply(top)
total_bill tip smoker day time size tip_pct
88 24.71 5.85 No Thur Lunch 2 0.236746
185 20.69 5.00 No Sun Dinner 5 0.241663
51 10.29 2.60 No Sun Dinner 2 0.252672
149 7.51 2.00 No Thur Lunch 2 0.266312
232 11.61 3.39 No Sat Dinner 2 0.291990
109 14.31 4.00 Yes Sat Dinner 2 0.279525
183 23.17 6.50 Yes Sun Dinner 4 0.280535
67 3.07 1.00 Yes Sat Dinner 1 0.325733
178 9.60 4.00 Yes Sun Dinner 2 0.416667
172 7.25 5.15 Yes Sun Dinner 2 0.710345
>>> frame = DataFrame({'data1':np.random.randn(1000),'data2':np.random.randn(1000)})
>>> factor = pd.cut(frame.data1,4)
>>> factor[:10]
0 (-1.6, -0.026]
1 (-1.6, -0.026]
2 (1.548, 3.123]
3 (-1.6, -0.026]
4 (-0.026, 1.548]
5 (-0.026, 1.548]
6 (-1.6, -0.026]
7 (-0.026, 1.548]
8 (-0.026, 1.548]
9 (-1.6, -0.026]
Name: data1, dtype: category
Categories (4, interval[float64]): [(-3.181, -1.6] < (-1.6, -0.026] < (-0.026, 1.548] <
(1.548, 3.123]]
>>> def get_stats(group):
... return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
...
>>> grouped = frame.data2.groupby(factor)
>>> grouped.apply(get_stats).unstack()
count max mean min
data1
(-3.181, -1.6] 47.0 1.560586 0.067778 -3.094980
(-1.6, -0.026] 431.0 2.920156 -0.031899 -2.778233
(-0.026, 1.548] 460.0 2.339734 -0.057856 -2.739892
(1.548, 3.123] 62.0 1.728365 -0.143399 -2.449822
>>> grouping = pd.qcut(frame.data1,10,labels=False)
>>> grouped = frame.data2.groupby(grouping)
>>> grouped.apply(get_stats).unstack()
count max mean min
data1
0 100.0 2.248114 0.069002 -3.094980
1 100.0 1.923236 -0.237785 -2.743977
2 100.0 2.920156 0.115480 -2.778233
3 100.0 2.481512 -0.060810 -2.581747
4 100.0 2.793314 0.030760 -2.595131
5 100.0 2.337741 -0.142877 -2.332392
6 100.0 2.339734 -0.046468 -2.589412
7 100.0 2.275533 -0.008744 -2.588843
8 100.0 1.901215 -0.095933 -2.739892
9 100.0 2.229256 -0.083296 -2.449822
透视表在excel中比较容易实现,下面讲解python实现。
将sex 和 smoker放到行上
>>> tips.pivot_table(index=['sex','smoker'])
size tip tip_pct total_bill
sex smoker
FeMale No 2.765432 3.211235 0.162538 20.298765
Yes 2.317073 2.965366 0.156436 21.481220
Male No 2.557143 2.738000 0.155614 17.903286
Yes 2.480769 3.042885 0.168526 20.184808
聚合'tip_pct','size',根据day进行分组
>>> tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker')
size tip_pct
smoker No Yes No Yes
sex day
FeMale Fri 2.250000 2.000000 0.151650 0.174211
Sat 2.826087 2.333333 0.160727 0.148431
Sun 3.000000 2.555556 0.160367 0.153037
Thur 2.555556 2.285714 0.167865 0.163618
Male Fri NaN 2.125000 NaN 0.175284
Sat 2.272727 2.583333 0.155247 0.147513
Sun 2.866667 2.600000 0.159884 0.218042
Thur 2.388889 2.400000 0.148948 0.164035
设置margins将会显示All列
>>> tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker',margins=True)
size ... tip_pct
smoker No Yes ... Yes All
sex day ...
FeMale Fri 2.250000 2.000000 ... 0.174211 0.166007
Sat 2.826087 2.333333 ... 0.148431 0.155329
Sun 3.000000 2.555556 ... 0.153037 0.158535
Thur 2.555556 2.285714 ... 0.163618 0.166990
Male Fri NaN 2.125000 ... 0.175284 0.175284
Sat 2.272727 2.583333 ... 0.147513 0.151212
Sun 2.866667 2.600000 ... 0.218042 0.174423
Thur 2.388889 2.400000 ... 0.164035 0.154336
All 2.668874 2.408602 ... 0.163196 0.160803
[9 rows x 6 columns]
其他聚合函数例如分组大小
>>> tips.pivot_table('size',index=['time','sex','smoker'],columns='day',aggfunc='sum',fill_value=0)
day Fri Sat Sun Thur
time sex smoker
Dinner FeMale No 6 65 81 0
Yes 10 42 23 0
Male No 0 50 86 2
Yes 10 62 26 0
Lunch FeMale No 3 0 0 69
Yes 4 0 0 16
Male No 0 0 0 41
Yes 7 0 0 24
对空值处理
>>> tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day',margins=True,aggfunc=len)
day Fri Sat Sun Thur All
sex smoker
FeMale No 4.0 23.0 27.0 27.0 81.0
Yes 7.0 18.0 9.0 7.0 41.0
Male No NaN 22.0 30.0 18.0 70.0
Yes 8.0 24.0 10.0 10.0 52.0
All 19.0 87.0 76.0 62.0 244.0
>>> data = DataFrame([[1,'F','R'],[2,'M','L'],[3,'F','R'],[4,'M','R'],[5,'M','L'],[6,'M','R'],[7,'F','R'],[8,'F','L'],[9,'M','R'],[10,'F','R']],columns=['Sample','Gender','Handedness'])
>>> data
Sample Gender Handedness
0 1 F R
1 2 M L
2 3 F R
3 4 M R
4 5 M L
5 6 M R
6 7 F R
7 8 F L
8 9 M R
9 10 F R
如下图,即为Gender,Handedness,的组合统计
>>> pd.crosstab(data.Gender,data.Handedness,margins=True)
Handedness L R All
Gender
F 1 4 5
M 2 3 5
All 3 7 10
crosstab的前两个参数可以是Series,数组,列表
>>> pd.crosstab([tips.time,tips.day],tips.smoker,margins=True)
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244
用平均值填充NA值:
>>> s = Series(np.random.randn(6))
>>> s[::2] = np.nan
>>> s
0 NaN
1 -1.430336
2 NaN
3 0.937739
4 NaN
5 0.236223
dtype: float64
>>> s.fillna(s.mean())
0 -0.085458
1 -1.430336
2 -0.085458
3 0.937739
4 -0.085458
5 0.236223
dtype: float64
假设你想根据分组填充不同数据,只需要将数据分组,并使用apply和一个能够对个数据块调用的fillna对的函数即可
>>> states = ['Ohio','New York','Vermont','Florida','Oregon','Nevada','California','Idaho']
>>> group_key = ['East']*4 + ['West'] *4
>>> data = Series(np.random.randn(8),index=states)
>>> data[['Vermont','Nevada','Idaho']] = np.nan
>>> data
Ohio -0.734886
New York 1.573174
Vermont NaN
Florida -1.172843
Oregon 0.988466
Nevada NaN
California -1.872393
Idaho NaN
dtype: float64
>>> data.groupby(group_key).mean()
East -0.111518
West -0.441964
dtype: float64
我们利用分组平均值去填充NA值
>>> fill_mean = lambda g: g.fillna(g.mean())
>>> data.groupby(group_key).apply(fill_mean)
Ohio -0.734886
New York 1.573174
Vermont -0.111518
Florida -1.172843
Oregon 0.988466
Nevada -0.441964
California -1.872393
Idaho -0.441964
dtype: float64
我们也可以在代码中预定义各组的填充值
>>> fill_values = {'East':0.5,'West':-1}
>>> fill_func = lambda g: g.fillna(fill_values[g.name])
>>> data.groupby(group_key).apply(fill_func)
Ohio -0.734886
New York 1.573174
Vermont 0.500000
Florida -1.172843
Oregon 0.988466
Nevada -1.000000
California -1.872393
Idaho -1.000000
dtype: float64
np.random.permutation(N),N为完整数据大小
>>> suits = ['H','S','C','D']
>>> card_val = (range(1,11)+[10]*3)*4
>>> card_val
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 10, 10]
>>> base_names = ["A"] + range(2,11) + ['J','K','Q']
>>> base_names
['A', 2, 3, 4, 5, 6, 7, 8, 9, 10, 'J', 'K', 'Q']
>>> carda = []
>>> for suit in suits:
... carda.extend(str(num) + suit for num in base_names)
>>> deck = Series(card_val,index=carda)
>>> deck[:13]
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
dtype: int64
>>> def draw(deck,n=5):
... return deck.take(np.random.permutation(len(deck))[:n])
...
>>> draw(deck)
3H 3
KS 10
QC 10
JS 10
10C 10
dtype: int64
>>> get_suit = lambda card: card[-1]
>>> deck.groupby(get_suit).apply(draw,n=2)
C 9C 9
JC 10
D 4D 4
AD 1
H 4H 4
10H 10
S 3S 3
KS 10
dtype: int64
>>> deck.groupby(get_suit,group_keys=False).apply(draw,n=2)
7C 7
4C 4
AD 1
5D 5
9H 9
4H 4
6S 6
KS 10
dtype: int64
>>> df = DataFrame({'category':['a','a','a','a','b','b','b','b'],'data':np.random.randn(8),'weights':np.random.rand(8)})
>>> df
category data weights
0 a -1.493554 0.300840
1 a -2.008278 0.693407
2 a 1.006548 0.736280
3 a -1.226051 0.128157
4 b -0.981050 0.327538
5 b -0.487632 0.201700
6 b -1.262182 0.201121
7 b -0.205049 0.206801
>>> grouped = df.groupby('category')
>>> get_wavg = lambda g: np.average(g['data'],weights = g['weights'])
>>> grouped.apply(get_wavg)
category
a -0.676769
b -0.763949
dtype: float64