GroupBy机制
split-apply-combine(拆分-应用-合并)
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1':['a','a','b','b','a'],
'key2':['one','two','one','two','one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
df
|
key1 |
key2 |
data1 |
data2 |
0 |
a |
one |
1.067118 |
-0.237576 |
1 |
a |
two |
0.613814 |
1.059002 |
2 |
b |
one |
2.682089 |
0.865306 |
3 |
b |
two |
-0.331019 |
-1.627436 |
4 |
a |
one |
-0.599142 |
-0.615921 |
按照key1进行分组,并计算data1列的平均值:访问data1,并根据key1调用groupby
grouped=df['data1'].groupby(df['key1'])
grouped
变量grouped是一个GroupBy对象,它实际上还没有进行任何计算,只是含有一些有关分组键df[‘key1’]的中间数据而已。该对象已经有了接下来对各分组执行运算所需的一切信息
grouped.mean()
key1
a 0.360597
b 1.175535
Name: data1, dtype: float64
means=df['data1'].groupby([df['key1'],df['key2']]).mean()
means
key1 key2
a one 0.233988
two 0.613814
b one 2.682089
two -0.331019
Name: data1, dtype: float64
means.unstack()
key2 |
one |
two |
key1 |
|
|
a |
0.233988 |
0.613814 |
b |
2.682089 |
-0.331019 |
分组键可以是任何长度适当的数组
states=np.array(['Ohio','California','California','Ohio','Ohio'])
years=np.array([2005,2005,2006,2005,2006])
df['data1'].groupby([states,years]).mean()
California 2005 0.613814
2006 2.682089
Ohio 2005 0.368050
2006 -0.599142
Name: data1, dtype: float64
通常,分组信息就位于相同的要处理DataFrame中。这里,你还可以将列名(可以
是字符串、数字或其他Python对象)用作分组键:
df.groupby('key1')
df.groupby(['key1','key2']).mean()
|
|
data1 |
data2 |
key1 |
key2 |
|
|
a |
one |
0.233988 |
-0.426748 |
two |
0.613814 |
1.059002 |
b |
one |
2.682089 |
0.865306 |
two |
-0.331019 |
-1.627436 |
df.groupby(['key1','key2']).size()
key1 key2
a one 2
two 1
b one 1
two 1
dtype: int64
对分组进行迭代
GroupBy对象支持迭代,可以产生一组二元元组(由分组名和数据块组成)
for name,group in df.groupby('key1'):
print(name)
print(group)
a
key1 key2 data1 data2
0 a one 1.067118 -0.237576
1 a two 0.613814 1.059002
4 a one -0.599142 -0.615921
b
key1 key2 data1 data2
2 b one 2.682089 0.865306
3 b two -0.331019 -1.627436
对于多重键的情况,元组的第一个元素将会由键值组成的元组
for (k1,k2),group in df.groupby(['key1','key2']):
print((k1,k2))
print(group)
('a', 'one')
key1 key2 data1 data2
0 a one 1.067118 -0.237576
4 a one -0.599142 -0.615921
('a', 'two')
key1 key2 data1 data2
1 a two 0.613814 1.059002
('b', 'one')
key1 key2 data1 data2
2 b one 2.682089 0.865306
('b', 'two')
key1 key2 data1 data2
3 b two -0.331019 -1.627436
你可以对这些数据片段做任何操作。将这些数据片段做成一个字典
pieces=dict(list(df.groupby('key1')))
pieces
{'a': key1 key2 data1 data2
0 a one 1.067118 -0.237576
1 a two 0.613814 1.059002
4 a one -0.599142 -0.615921,
'b': key1 key2 data1 data2
2 b one 2.682089 0.865306
3 b two -0.331019 -1.627436}
groupby默认是在axis=0上进行分组的,通过设置可以在其他任何轴上进行分组
df.dtypes
key1 object
key2 object
data1 float64
data2 float64
dtype: object
grouped=df.groupby(df.dtypes,axis=1)
grouped
for dtype,group in grouped:
print(dtype)
print(group)
float64
data1 data2
0 1.067118 -0.237576
1 0.613814 1.059002
2 2.682089 0.865306
3 -0.331019 -1.627436
4 -0.599142 -0.615921
object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one
选取一列或列的子集
对于DataFrame产生的GroupBy对象,如果用一个(单个字符串)或一组(字符串数组)列名对其进行索引,就能实现选取部分列进行聚合的目的
df.groupby('key1')['data1']
df.groupby('key1')[['data2']]
df['data1'].groupby(df['key1'])
df[['data2']].groupby(df['key1'])
对于大数据集,很可能只需要对部分列进行聚合
df.groupby(['key1','key2'])[['data2']].mean()
|
|
data2 |
key1 |
key2 |
|
a |
one |
-0.426748 |
two |
1.059002 |
b |
one |
0.865306 |
two |
-1.627436 |
上述索引操作返回的对象是一个已分组的DataFrame(如果传入的是列表或数组)或已分组的Series(如果传入的是标量形式的单个列名)
s_grouped=df.groupby(['key1','key2'])['data2']
s_grouped
s_grouped.mean()
key1 key2
a one -0.426748
two 1.059002
b one 0.865306
two -1.627436
Name: data2, dtype: float64
通过字典或Series进行分组
people=pd.DataFrame(np.random.randn(5,5),
columns=['a','b','c','d','e'],
index=['Joe','Steve','Wes','Jim','Travis'])
people.iloc[2:3,[1,2]]=np.nan
people
|
a |
b |
c |
d |
e |
Joe |
1.064395 |
0.458101 |
-0.647722 |
0.913158 |
0.266544 |
Steve |
0.464769 |
-0.152126 |
0.692518 |
0.800796 |
0.020842 |
Wes |
0.534295 |
NaN |
NaN |
1.031494 |
1.019269 |
Jim |
0.485397 |
0.930118 |
-0.124346 |
0.447506 |
0.761696 |
Travis |
-0.108946 |
2.500092 |
1.173428 |
0.467239 |
1.579619 |
假设已知列的分组关系,希望根据分组计算列的和
mapping={'a': 'red',
'b': 'red',
'c': 'blue',
'd': 'blue',
'e': 'red',
'f' : 'orange'}
将这个字典传给gruopby,来构造数组,但我们可以直接传递字典
by_column=people.groupby(mapping,axis=1)
by_column.sum()
|
blue |
red |
Joe |
0.265436 |
1.789040 |
Steve |
1.493315 |
0.333485 |
Wes |
1.031494 |
1.553564 |
Jim |
0.323159 |
2.177211 |
Travis |
1.640667 |
3.970764 |
map_series=pd.Series(mapping)
map_series
a red
b red
c blue
d blue
e red
f orange
dtype: object
people.groupby(map_series,axis=1).count()
|
blue |
red |
Joe |
2 |
3 |
Steve |
2 |
3 |
Wes |
1 |
2 |
Jim |
2 |
3 |
Travis |
2 |
3 |
通过函数进行分组
people.groupby(len).sum()
|
a |
b |
c |
d |
e |
3 |
2.084087 |
1.388219 |
-0.772069 |
2.392158 |
2.047509 |
5 |
0.464769 |
-0.152126 |
0.692518 |
0.800796 |
0.020842 |
6 |
-0.108946 |
2.500092 |
1.173428 |
0.467239 |
1.579619 |
将函数跟数组、列表、字典、Series混合使用也可以,因为任何东西在内部都会被转换成数组
key_list=['one','one','one','two','two']
people.groupby([len,key_list]).min()
|
|
a |
b |
c |
d |
e |
3 |
one |
0.534295 |
0.458101 |
-0.647722 |
0.913158 |
0.266544 |
two |
0.485397 |
0.930118 |
-0.124346 |
0.447506 |
0.761696 |
5 |
one |
0.464769 |
-0.152126 |
0.692518 |
0.800796 |
0.020842 |
6 |
two |
-0.108946 |
2.500092 |
1.173428 |
0.467239 |
1.579619 |
根据索引级别分组
层次化索引数据集最方便的地方在于它能根据轴索引的一个级别进行聚合
columns=pd.MultiIndex.from_arrays([['US','US','US','JP','JP'],
[1,3,5,1,3]],
names=['city','tenor'])
hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns)
hier_df
city |
US |
JP |
tenor |
1 |
3 |
5 |
1 |
3 |
0 |
-2.712569 |
0.036956 |
-0.643195 |
0.770231 |
-1.886666 |
1 |
-0.841508 |
1.772366 |
0.550501 |
0.237302 |
0.968424 |
2 |
-1.221179 |
-0.434443 |
-0.475438 |
0.666095 |
-0.083710 |
3 |
-0.519055 |
2.062299 |
1.077549 |
-1.361656 |
-1.274750 |
要根据级别分组,使用level关键字传递级别序号或名字
hier_df.groupby(level='city',axis=1).count()
city |
JP |
US |
0 |
2 |
3 |
1 |
2 |
3 |
2 |
2 |
3 |
3 |
2 |
3 |
数据聚合
聚合指的是任何能够从数组产生标量值的数据转换过程
函数名 |
说明 |
count |
分组中非NA值的数量 |
sum |
非NA值的和 |
mean |
非NA值的平均值 |
median |
非NA值的算数平均中位数 |
std、var |
无偏(分母为n-1)标准差和方差 |
min、max |
非NA值的最小值和最大值 |
prod |
非NA值的积 |
first、last |
第一个和最后一个非NA值 |
quantile计算Series或DataFrame列的样本分位数
df
|
key1 |
key2 |
data1 |
data2 |
0 |
a |
one |
1.067118 |
-0.237576 |
1 |
a |
two |
0.613814 |
1.059002 |
2 |
b |
one |
2.682089 |
0.865306 |
3 |
b |
two |
-0.331019 |
-1.627436 |
4 |
a |
one |
-0.599142 |
-0.615921 |
grouped=df.groupby(['key1','key2'])
grouped['data1'].quantile(0.9)
key1 key2
a one 0.900492
two 0.613814
b one 2.682089
two -0.331019
Name: data1, dtype: float64
如果要使用自己的聚合函数,只需将其传入aggregate或agg方法
def peak_to_peak(arr):
return arr.max()-arr.min()
grouped.agg(peak_to_peak)
|
|
data1 |
data2 |
key1 |
key2 |
|
|
a |
one |
1.66626 |
0.378344 |
two |
0.00000 |
0.000000 |
b |
one |
0.00000 |
0.000000 |
two |
0.00000 |
0.000000 |
grouped.describe()
|
|
data1 |
data2 |
|
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
count |
mean |
std |
min |
25% |
50% |
75% |
max |
key1 |
key2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a |
one |
2.0 |
0.233988 |
1.178224 |
-0.599142 |
-0.182577 |
0.233988 |
0.650553 |
1.067118 |
2.0 |
-0.426748 |
0.26753 |
-0.615921 |
-0.521335 |
-0.426748 |
-0.332162 |
-0.237576 |
two |
1.0 |
0.613814 |
NaN |
0.613814 |
0.613814 |
0.613814 |
0.613814 |
0.613814 |
1.0 |
1.059002 |
NaN |
1.059002 |
1.059002 |
1.059002 |
1.059002 |
1.059002 |
b |
one |
1.0 |
2.682089 |
NaN |
2.682089 |
2.682089 |
2.682089 |
2.682089 |
2.682089 |
1.0 |
0.865306 |
NaN |
0.865306 |
0.865306 |
0.865306 |
0.865306 |
0.865306 |
two |
1.0 |
-0.331019 |
NaN |
-0.331019 |
-0.331019 |
-0.331019 |
-0.331019 |
-0.331019 |
1.0 |
-1.627436 |
NaN |
-1.627436 |
-1.627436 |
-1.627436 |
-1.627436 |
-1.627436 |
面向列的多函数应用
tips=pd.read_csv("F:/项目学习/利用Pyhon进行数据分析(第二版)/利用Pyhon进行数据分析/pydata-book-2nd-edition/examples/tips.csv")
tips.head()
|
total_bill |
tip |
smoker |
day |
time |
size |
0 |
16.99 |
1.01 |
No |
Sun |
Dinner |
2 |
1 |
10.34 |
1.66 |
No |
Sun |
Dinner |
3 |
2 |
21.01 |
3.50 |
No |
Sun |
Dinner |
3 |
3 |
23.68 |
3.31 |
No |
Sun |
Dinner |
2 |
4 |
24.59 |
3.61 |
No |
Sun |
Dinner |
4 |
tips['tip_pct']=tips['tip']/tips['total_bill']
tips.head()
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
0 |
16.99 |
1.01 |
No |
Sun |
Dinner |
2 |
0.059447 |
1 |
10.34 |
1.66 |
No |
Sun |
Dinner |
3 |
0.160542 |
2 |
21.01 |
3.50 |
No |
Sun |
Dinner |
3 |
0.166587 |
3 |
23.68 |
3.31 |
No |
Sun |
Dinner |
2 |
0.139780 |
4 |
24.59 |
3.61 |
No |
Sun |
Dinner |
4 |
0.146808 |
grouped=tips.groupby(['day','smoker'])
grouped_pct=grouped['tip_pct']
grouped_pct.agg('mean')
day smoker
Fri No 0.151650
Yes 0.174783
Sat No 0.158048
Yes 0.147906
Sun No 0.160113
Yes 0.187250
Thur No 0.160298
Yes 0.163863
Name: tip_pct, dtype: float64
如果传入一组函数或函数名,得到的DataFrame的列就会以相应的函数命名
grouped_pct.agg(['mean','std',peak_to_peak])
|
|
mean |
std |
peak_to_peak |
day |
smoker |
|
|
|
Fri |
No |
0.151650 |
0.028123 |
0.067349 |
Yes |
0.174783 |
0.051293 |
0.159925 |
Sat |
No |
0.158048 |
0.039767 |
0.235193 |
Yes |
0.147906 |
0.061375 |
0.290095 |
Sun |
No |
0.160113 |
0.042347 |
0.193226 |
Yes |
0.187250 |
0.154134 |
0.644685 |
Thur |
No |
0.160298 |
0.038774 |
0.193350 |
Yes |
0.163863 |
0.039389 |
0.151240 |
如果传入的是一个由(name,function)元组组成的列表,则各元组的第一个元素就会被用作DataFrame的列名
grouped_pct.agg([('foo','mean'),('bar',np.std)])
|
|
foo |
bar |
day |
smoker |
|
|
Fri |
No |
0.151650 |
0.028123 |
Yes |
0.174783 |
0.051293 |
Sat |
No |
0.158048 |
0.039767 |
Yes |
0.147906 |
0.061375 |
Sun |
No |
0.160113 |
0.042347 |
Yes |
0.187250 |
0.154134 |
Thur |
No |
0.160298 |
0.038774 |
Yes |
0.163863 |
0.039389 |
可以定义一组应用于全部列的一组函数,或不同列应用于不同的函数。
functions=['count','mean','max']
result=grouped[['tip_pct','total_bill']].agg(functions)
result
|
|
tip_pct |
total_bill |
|
|
count |
mean |
max |
count |
mean |
max |
day |
smoker |
|
|
|
|
|
|
Fri |
No |
4 |
0.151650 |
0.187735 |
4 |
18.420000 |
22.75 |
Yes |
15 |
0.174783 |
0.263480 |
15 |
16.813333 |
40.17 |
Sat |
No |
45 |
0.158048 |
0.291990 |
45 |
19.661778 |
48.33 |
Yes |
42 |
0.147906 |
0.325733 |
42 |
21.276667 |
50.81 |
Sun |
No |
57 |
0.160113 |
0.252672 |
57 |
20.506667 |
48.17 |
Yes |
19 |
0.187250 |
0.710345 |
19 |
24.120000 |
45.35 |
Thur |
No |
45 |
0.160298 |
0.266312 |
45 |
17.113111 |
41.19 |
Yes |
17 |
0.163863 |
0.241255 |
17 |
19.190588 |
43.11 |
result['tip_pct']
|
|
count |
mean |
max |
day |
smoker |
|
|
|
Fri |
No |
4 |
0.151650 |
0.187735 |
Yes |
15 |
0.174783 |
0.263480 |
Sat |
No |
45 |
0.158048 |
0.291990 |
Yes |
42 |
0.147906 |
0.325733 |
Sun |
No |
57 |
0.160113 |
0.252672 |
Yes |
19 |
0.187250 |
0.710345 |
Thur |
No |
45 |
0.160298 |
0.266312 |
Yes |
17 |
0.163863 |
0.241255 |
ftuples = [('Durchschnitt', 'mean'),('Abweichung', np.var)]
grouped[['tip_pct', 'total_bill']].agg(ftuples)
|
|
tip_pct |
total_bill |
|
|
Durchschnitt |
Abweichung |
Durchschnitt |
Abweichung |
day |
smoker |
|
|
|
|
Fri |
No |
0.151650 |
0.000791 |
18.420000 |
25.596333 |
Yes |
0.174783 |
0.002631 |
16.813333 |
82.562438 |
Sat |
No |
0.158048 |
0.001581 |
19.661778 |
79.908965 |
Yes |
0.147906 |
0.003767 |
21.276667 |
101.387535 |
Sun |
No |
0.160113 |
0.001793 |
20.506667 |
66.099980 |
Yes |
0.187250 |
0.023757 |
24.120000 |
109.046044 |
Thur |
No |
0.160298 |
0.001503 |
17.113111 |
59.625081 |
Yes |
0.163863 |
0.001551 |
19.190588 |
69.808518 |
对一个列或不同的列应用不同的函数,具体的办法是向agg传入一个从列名映射到函数的字典
grouped.agg({'tip':np.max,'size':'sum'})
|
|
tip |
size |
day |
smoker |
|
|
Fri |
No |
3.50 |
9 |
Yes |
4.73 |
31 |
Sat |
No |
9.00 |
115 |
Yes |
10.00 |
104 |
Sun |
No |
6.00 |
167 |
Yes |
6.50 |
49 |
Thur |
No |
6.70 |
112 |
Yes |
5.00 |
40 |
只有将多个函数应用到至少一列时,DataFrame才会有层次化
grouped.agg({'tip_pct':['min','max','mean','std'],'size':'sum'})
|
|
tip_pct |
size |
|
|
min |
max |
mean |
std |
sum |
day |
smoker |
|
|
|
|
|
Fri |
No |
0.120385 |
0.187735 |
0.151650 |
0.028123 |
9 |
Yes |
0.103555 |
0.263480 |
0.174783 |
0.051293 |
31 |
Sat |
No |
0.056797 |
0.291990 |
0.158048 |
0.039767 |
115 |
Yes |
0.035638 |
0.325733 |
0.147906 |
0.061375 |
104 |
Sun |
No |
0.059447 |
0.252672 |
0.160113 |
0.042347 |
167 |
Yes |
0.065660 |
0.710345 |
0.187250 |
0.154134 |
49 |
Thur |
No |
0.072961 |
0.266312 |
0.160298 |
0.038774 |
112 |
Yes |
0.090014 |
0.241255 |
0.163863 |
0.039389 |
40 |
以没有行索引的形式返回依聚合数据
所有示例中的聚合数据都有唯一的分组键组成的索引(可能是层次化)。可以向groupby传入as_index=False禁用该功能
tips.info()
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_bill 244 non-null float64
1 tip 244 non-null float64
2 smoker 244 non-null object
3 day 244 non-null object
4 time 244 non-null object
5 size 244 non-null int64
6 tip_pct 244 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 13.5+ KB
num_col=tips.select_dtypes(include=['int64','float64'])
num_col
|
total_bill |
tip |
size |
tip_pct |
0 |
16.99 |
1.01 |
2 |
0.059447 |
1 |
10.34 |
1.66 |
3 |
0.160542 |
2 |
21.01 |
3.50 |
3 |
0.166587 |
3 |
23.68 |
3.31 |
2 |
0.139780 |
4 |
24.59 |
3.61 |
4 |
0.146808 |
... |
... |
... |
... |
... |
239 |
29.03 |
5.92 |
3 |
0.203927 |
240 |
27.18 |
2.00 |
2 |
0.073584 |
241 |
22.67 |
2.00 |
2 |
0.088222 |
242 |
17.82 |
1.75 |
2 |
0.098204 |
243 |
18.78 |
3.00 |
2 |
0.159744 |
244 rows × 4 columns
tips
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
0 |
16.99 |
1.01 |
No |
Sun |
Dinner |
2 |
0.059447 |
1 |
10.34 |
1.66 |
No |
Sun |
Dinner |
3 |
0.160542 |
2 |
21.01 |
3.50 |
No |
Sun |
Dinner |
3 |
0.166587 |
3 |
23.68 |
3.31 |
No |
Sun |
Dinner |
2 |
0.139780 |
4 |
24.59 |
3.61 |
No |
Sun |
Dinner |
4 |
0.146808 |
... |
... |
... |
... |
... |
... |
... |
... |
239 |
29.03 |
5.92 |
No |
Sat |
Dinner |
3 |
0.203927 |
240 |
27.18 |
2.00 |
Yes |
Sat |
Dinner |
2 |
0.073584 |
241 |
22.67 |
2.00 |
Yes |
Sat |
Dinner |
2 |
0.088222 |
242 |
17.82 |
1.75 |
No |
Sat |
Dinner |
2 |
0.098204 |
243 |
18.78 |
3.00 |
No |
Thur |
Dinner |
2 |
0.159744 |
244 rows × 7 columns
tips_1=tips.drop('time',axis=1)
tips_1.groupby(['day','smoker']).mean()
|
|
total_bill |
tip |
size |
tip_pct |
day |
smoker |
|
|
|
|
Fri |
No |
18.420000 |
2.812500 |
2.250000 |
0.151650 |
Yes |
16.813333 |
2.714000 |
2.066667 |
0.174783 |
Sat |
No |
19.661778 |
3.102889 |
2.555556 |
0.158048 |
Yes |
21.276667 |
2.875476 |
2.476190 |
0.147906 |
Sun |
No |
20.506667 |
3.167895 |
2.929825 |
0.160113 |
Yes |
24.120000 |
3.516842 |
2.578947 |
0.187250 |
Thur |
No |
17.113111 |
2.673778 |
2.488889 |
0.160298 |
Yes |
19.190588 |
3.030000 |
2.352941 |
0.163863 |
tips_1.groupby(['day','smoker'],as_index=False).mean()
|
day |
smoker |
total_bill |
tip |
size |
tip_pct |
0 |
Fri |
No |
18.420000 |
2.812500 |
2.250000 |
0.151650 |
1 |
Fri |
Yes |
16.813333 |
2.714000 |
2.066667 |
0.174783 |
2 |
Sat |
No |
19.661778 |
3.102889 |
2.555556 |
0.158048 |
3 |
Sat |
Yes |
21.276667 |
2.875476 |
2.476190 |
0.147906 |
4 |
Sun |
No |
20.506667 |
3.167895 |
2.929825 |
0.160113 |
5 |
Sun |
Yes |
24.120000 |
3.516842 |
2.578947 |
0.187250 |
6 |
Thur |
No |
17.113111 |
2.673778 |
2.488889 |
0.160298 |
7 |
Thur |
Yes |
19.190588 |
3.030000 |
2.352941 |
0.163863 |
apply:一般性的“拆分-应用-合并”
def top(df,n=5,column='tip_pct'):
return df.sort_values(by=column)[-n:]
top(tips,n=6)
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
109 |
14.31 |
4.00 |
Yes |
Sat |
Dinner |
2 |
0.279525 |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
tips.groupby('smoker').apply(top)
|
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
smoker |
|
|
|
|
|
|
|
|
No |
88 |
24.71 |
5.85 |
No |
Thur |
Lunch |
2 |
0.236746 |
185 |
20.69 |
5.00 |
No |
Sun |
Dinner |
5 |
0.241663 |
51 |
10.29 |
2.60 |
No |
Sun |
Dinner |
2 |
0.252672 |
149 |
7.51 |
2.00 |
No |
Thur |
Lunch |
2 |
0.266312 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
Yes |
109 |
14.31 |
4.00 |
Yes |
Sat |
Dinner |
2 |
0.279525 |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
tips.groupby(['smoker','day']).apply(top,n=1,column='total_bill')
|
|
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
smoker |
day |
|
|
|
|
|
|
|
|
No |
Fri |
94 |
22.75 |
3.25 |
No |
Fri |
Dinner |
2 |
0.142857 |
Sat |
212 |
48.33 |
9.00 |
No |
Sat |
Dinner |
4 |
0.186220 |
Sun |
156 |
48.17 |
5.00 |
No |
Sun |
Dinner |
6 |
0.103799 |
Thur |
142 |
41.19 |
5.00 |
No |
Thur |
Lunch |
5 |
0.121389 |
Yes |
Fri |
95 |
40.17 |
4.73 |
Yes |
Fri |
Dinner |
4 |
0.117750 |
Sat |
170 |
50.81 |
10.00 |
Yes |
Sat |
Dinner |
3 |
0.196812 |
Sun |
182 |
45.35 |
3.50 |
Yes |
Sun |
Dinner |
3 |
0.077178 |
Thur |
197 |
43.11 |
5.00 |
Yes |
Thur |
Lunch |
4 |
0.115982 |
result=tips.groupby('smoker')['tip_pct'].describe()
result
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
smoker |
|
|
|
|
|
|
|
|
No |
151.0 |
0.159328 |
0.039910 |
0.056797 |
0.136906 |
0.155625 |
0.185014 |
0.291990 |
Yes |
93.0 |
0.163196 |
0.085119 |
0.035638 |
0.106771 |
0.153846 |
0.195059 |
0.710345 |
result.unstack('smoker')
smoker
count No 151.000000
Yes 93.000000
mean No 0.159328
Yes 0.163196
std No 0.039910
Yes 0.085119
min No 0.056797
Yes 0.035638
25% No 0.136906
Yes 0.106771
50% No 0.155625
Yes 0.153846
75% No 0.185014
Yes 0.195059
max No 0.291990
Yes 0.710345
dtype: float64
在groupby中调用describe之类的函数,实际上是应用了下面两条代码的快捷方式:
f=lambda x:x.describe()
grouped.apply(f)
禁止分组键
分组键会跟原始对象的索引共同构成结果对象中的层次
化索引。将group_keys=False传入groupby即可禁止该效果
tips.groupby('smoker',group_keys=False).apply(top)
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
88 |
24.71 |
5.85 |
No |
Thur |
Lunch |
2 |
0.236746 |
185 |
20.69 |
5.00 |
No |
Sun |
Dinner |
5 |
0.241663 |
51 |
10.29 |
2.60 |
No |
Sun |
Dinner |
2 |
0.252672 |
149 |
7.51 |
2.00 |
No |
Thur |
Lunch |
2 |
0.266312 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
109 |
14.31 |
4.00 |
Yes |
Sat |
Dinner |
2 |
0.279525 |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
分位数和桶分析
frame = pd.DataFrame({'data1': np.random.randn(1000),
'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
quartiles[:10]
0 (-0.0596, 1.69]
1 (-3.565, -1.809]
2 (-1.809, -0.0596]
3 (-0.0596, 1.69]
4 (-0.0596, 1.69]
5 (-0.0596, 1.69]
6 (-1.809, -0.0596]
7 (-1.809, -0.0596]
8 (-1.809, -0.0596]
9 (-1.809, -0.0596]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.565, -1.809] < (-1.809, -0.0596] < (-0.0596, 1.69] < (1.69, 3.439]]
def get_stats(group):
return {'min':group.min(),'max':group.max(),'count':group.count(),'mean':group.mean()}
grouped=frame.data2.groupby(quartiles)
grouped.apply(get_stats).unstack()
|
min |
max |
count |
mean |
data1 |
|
|
|
|
(-3.565, -1.809] |
-2.240036 |
2.383580 |
40.0 |
-0.147557 |
(-1.809, -0.0596] |
-2.455718 |
3.207027 |
452.0 |
-0.016250 |
(-0.0596, 1.69] |
-2.989626 |
3.232200 |
459.0 |
0.048823 |
(1.69, 3.439] |
-1.774519 |
2.201069 |
49.0 |
0.054332 |
这些都是长度相等的桶。要根据样本分位数得到大小相等的桶,使用qcut。传入labels=False可只获得分位数的编号
grouping = pd.qcut(frame.data1, 10, labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
|
min |
max |
count |
mean |
data1 |
|
|
|
|
0 |
-2.240036 |
2.617873 |
100.0 |
-0.036513 |
1 |
-2.162690 |
3.207027 |
100.0 |
0.038019 |
2 |
-2.312058 |
1.901076 |
100.0 |
-0.052351 |
3 |
-2.297201 |
2.963489 |
100.0 |
-0.033958 |
4 |
-2.455718 |
3.178088 |
100.0 |
-0.083398 |
5 |
-2.989626 |
2.336623 |
100.0 |
0.110641 |
6 |
-2.269297 |
3.232200 |
100.0 |
0.074222 |
7 |
-2.115640 |
2.599014 |
100.0 |
0.070273 |
8 |
-2.387508 |
2.364138 |
100.0 |
-0.062957 |
9 |
-1.774519 |
2.333942 |
100.0 |
0.094271 |
示例:用特定于分组的值填充缺失值
对于缺失数据的清理工作,有时你会用dropna将其替换掉,而有时则可能会希望用
一个固定值或由数据集本身所衍生出来的值去填充NA值。这时就得使用fillna这个
工具
s = pd.Series(np.random.randn(6))
s
0 1.264682
1 0.495235
2 -0.016187
3 1.781491
4 -0.231563
5 -0.393924
dtype: float64
s[::2] = np.nan
s
0 NaN
1 0.495235
2 NaN
3 1.781491
4 NaN
5 -0.393924
dtype: float64
s.fillna(s.mean())
0 0.627601
1 0.495235
2 0.627601
3 1.781491
4 0.627601
5 -0.393924
dtype: float64
假设你需要对不同的分组填充不同的值。一种方法是将数据分组,并使用apply和一个能够对各数据块调用fillna的函数即可。
states = ['Ohio', 'New York', 'Vermont', 'Florida',
'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data
Ohio 1.045855
New York 1.075995
Vermont 0.425475
Florida 0.086684
Oregon -1.262191
Nevada -0.209671
California 0.120289
Idaho -0.564744
dtype: float64
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data
Ohio 1.045855
New York 1.075995
Vermont NaN
Florida 0.086684
Oregon -1.262191
Nevada NaN
California 0.120289
Idaho NaN
dtype: float64
data.groupby(group_key).mean()
East 0.736178
West -0.570951
dtype: float64
我们可以用分组平均值去填充NA值
fill_mean=lambda g:g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
East Ohio 1.045855
New York 1.075995
Vermont 0.736178
Florida 0.086684
West Oregon -1.262191
Nevada -0.570951
California 0.120289
Idaho -0.570951
dtype: float64
也可以在代码中定义各组的填充值
fill_values={'East':0.5,'West':-1}
fill_func=lambda g:g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
East Ohio 1.045855
New York 1.075995
Vermont 0.500000
Florida 0.086684
West Oregon -1.262191
Nevada -1.000000
California 0.120289
Idaho -1.000000
dtype: float64
示例:随机采样和排列
假设你想要从一个大数据集中随机抽取(进行替换或不替换)样本以进行蒙特卡罗
模拟(Monte Carlo simulation)或其他分析工作。“抽取”的方式有很多,这里使用的方法是对Series使用sample方法
suits=['H','S','C','D']
card_val=(list(range(1,11))+[10]*3)*4
base_names=['A']+list(range(2,11))+['J','K','Q']
cards=[]
for suit in['H','S','C','D']:
cards.extend(str(num)+suit for num in base_names)
deck=pd.Series(card_val,index=cards)
deck
AH 1
2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
JH 10
KH 10
QH 10
AS 1
2S 2
3S 3
4S 4
5S 5
6S 6
7S 7
8S 8
9S 9
10S 10
JS 10
KS 10
QS 10
AC 1
2C 2
3C 3
4C 4
5C 5
6C 6
7C 7
8C 8
9C 9
10C 10
JC 10
KC 10
QC 10
AD 1
2D 2
3D 3
4D 4
5D 5
6D 6
7D 7
8D 8
9D 9
10D 10
JD 10
KD 10
QD 10
dtype: int64
def draw(deck,n=5):
return deck.sample(n)
draw(deck)
10H 10
9H 9
6H 6
JC 10
QH 10
dtype: int64
想要从每种花色中随机抽取两张牌。由于花色是牌名的最后一个字符,所以
我们可以据此进行分组,并使用apply
get_suit=lambda card:card[-1]
deck.groupby(get_suit).apply(draw,n=2)
C 7C 7
KC 10
D 4D 4
AD 1
H 3H 3
8H 8
S KS 10
JS 10
dtype: int64
deck.groupby(get_suit,group_keys=False).apply(draw,n=2)
5C 5
4C 4
6D 6
4D 4
2H 2
5H 5
8S 8
3S 3
dtype: int64
示例:分组加权平均数和相关系数
df = pd.DataFrame({'category': ['a', 'a', 'a', 'a','b', 'b', 'b', 'b'],
'data': np.random.randn(8),
'weights': np.random.rand(8)})
df
|
category |
data |
weights |
0 |
a |
-1.022846 |
0.702148 |
1 |
a |
0.405966 |
0.095783 |
2 |
a |
0.282171 |
0.439928 |
3 |
a |
0.541287 |
0.866943 |
4 |
b |
0.695519 |
0.363663 |
5 |
b |
-0.419917 |
0.270279 |
6 |
b |
-0.077227 |
0.565714 |
7 |
b |
1.600511 |
0.636178 |
grouped=df.groupby('category')
get_wavg=lambda g:np.average(g['data'],weights=g['weights'])
grouped.apply(get_wavg)
category
a -0.040814
b 0.606788
dtype: float64
Yahoo!Finance的数据集,其中含有几只股票和标准普尔500指数(符号SPX)的收盘价
close_px=pd.read_csv('F:/项目学习/利用Pyhon进行数据分析(第二版)/利用Pyhon进行数据分析/pydata-book-2nd-edition/examples/stock_px_2.csv',parse_dates=True,index_col=0)
close_px.head()
|
AAPL |
MSFT |
XOM |
SPX |
2003-01-02 |
7.40 |
21.11 |
29.22 |
909.03 |
2003-01-03 |
7.45 |
21.14 |
29.24 |
908.59 |
2003-01-06 |
7.45 |
21.52 |
29.96 |
929.01 |
2003-01-07 |
7.43 |
21.93 |
28.95 |
922.93 |
2003-01-08 |
7.28 |
21.31 |
28.83 |
909.93 |
close_px.info()
DatetimeIndex: 2214 entries, 2003-01-02 to 2011-10-14
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AAPL 2214 non-null float64
1 MSFT 2214 non-null float64
2 XOM 2214 non-null float64
3 SPX 2214 non-null float64
dtypes: float64(4)
memory usage: 86.5 KB
close_px[-4:]
|
AAPL |
MSFT |
XOM |
SPX |
2011-10-11 |
400.29 |
27.00 |
76.27 |
1195.54 |
2011-10-12 |
402.19 |
26.96 |
77.16 |
1207.25 |
2011-10-13 |
408.43 |
27.18 |
76.37 |
1203.66 |
2011-10-14 |
422.00 |
27.27 |
78.11 |
1224.58 |
spx_corr=lambda x:x.corrwith(x['SPX'])
rets=close_px.pct_change().dropna()
rets
|
AAPL |
MSFT |
XOM |
SPX |
2003-01-03 |
0.006757 |
0.001421 |
0.000684 |
-0.000484 |
2003-01-06 |
0.000000 |
0.017975 |
0.024624 |
0.022474 |
2003-01-07 |
-0.002685 |
0.019052 |
-0.033712 |
-0.006545 |
2003-01-08 |
-0.020188 |
-0.028272 |
-0.004145 |
-0.014086 |
2003-01-09 |
0.008242 |
0.029094 |
0.021159 |
0.019386 |
... |
... |
... |
... |
... |
2011-10-10 |
0.051406 |
0.026286 |
0.036977 |
0.034125 |
2011-10-11 |
0.029526 |
0.002227 |
-0.000131 |
0.000544 |
2011-10-12 |
0.004747 |
-0.001481 |
0.011669 |
0.009795 |
2011-10-13 |
0.015515 |
0.008160 |
-0.010238 |
-0.002974 |
2011-10-14 |
0.033225 |
0.003311 |
0.022784 |
0.017380 |
2213 rows × 4 columns
get_year=lambda x:x.year
by_year=rets.groupby(get_year)
by_year.apply(spx_corr)
|
AAPL |
MSFT |
XOM |
SPX |
2003 |
0.541124 |
0.745174 |
0.661265 |
1.0 |
2004 |
0.374283 |
0.588531 |
0.557742 |
1.0 |
2005 |
0.467540 |
0.562374 |
0.631010 |
1.0 |
2006 |
0.428267 |
0.406126 |
0.518514 |
1.0 |
2007 |
0.508118 |
0.658770 |
0.786264 |
1.0 |
2008 |
0.681434 |
0.804626 |
0.828303 |
1.0 |
2009 |
0.707103 |
0.654902 |
0.797921 |
1.0 |
2010 |
0.710105 |
0.730118 |
0.839057 |
1.0 |
2011 |
0.691931 |
0.800996 |
0.859975 |
1.0 |
计算列与列之间的相关系数
by_year.apply(lambda g:g['AAPL'].corr(g['MSFT']))
2003 0.480868
2004 0.259024
2005 0.300093
2006 0.161735
2007 0.417738
2008 0.611901
2009 0.432738
2010 0.571946
2011 0.581987
dtype: float64
示例:组级别的线性回归
import statsmodels.api as sm
def regress(data,yvar,xvars):
Y=data[yvar]
X=data[xvars]
X['intercept']=1
result=sm.OLS(Y,X).fit()
return result.params
by_year.apply(regress,'AAPL',['SPX'])
|
SPX |
intercept |
2003 |
1.195406 |
0.000710 |
2004 |
1.363463 |
0.004201 |
2005 |
1.766415 |
0.003246 |
2006 |
1.645496 |
0.000080 |
2007 |
1.198761 |
0.003438 |
2008 |
0.968016 |
-0.001110 |
2009 |
0.879103 |
0.002954 |
2010 |
1.052608 |
0.001261 |
2011 |
0.806605 |
0.001514 |
透视表和交叉表
透视表
tips_1.pivot_table(index=['day','smoker'])
|
|
size |
tip |
tip_pct |
total_bill |
day |
smoker |
|
|
|
|
Fri |
No |
2.250000 |
2.812500 |
0.151650 |
18.420000 |
Yes |
2.066667 |
2.714000 |
0.174783 |
16.813333 |
Sat |
No |
2.555556 |
3.102889 |
0.158048 |
19.661778 |
Yes |
2.476190 |
2.875476 |
0.147906 |
21.276667 |
Sun |
No |
2.929825 |
3.167895 |
0.160113 |
20.506667 |
Yes |
2.578947 |
3.516842 |
0.187250 |
24.120000 |
Thur |
No |
2.488889 |
2.673778 |
0.160298 |
17.113111 |
Yes |
2.352941 |
3.030000 |
0.163863 |
19.190588 |
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker')
|
|
size |
tip_pct |
|
smoker |
No |
Yes |
No |
Yes |
time |
day |
|
|
|
|
Dinner |
Fri |
2.000000 |
2.222222 |
0.139622 |
0.165347 |
Sat |
2.555556 |
2.476190 |
0.158048 |
0.147906 |
Sun |
2.929825 |
2.578947 |
0.160113 |
0.187250 |
Thur |
2.000000 |
NaN |
0.159744 |
NaN |
Lunch |
Fri |
3.000000 |
1.833333 |
0.187735 |
0.188937 |
Thur |
2.500000 |
2.352941 |
0.160311 |
0.163863 |
传入margins=True添加分项小计。这将会添加标签为All的行和列,其值对应于单个等级中所有数据的分组统计
tips.pivot_table(['tip_pct','size'],index=['time','day'],columns='smoker',margins=True)
|
|
size |
tip_pct |
|
smoker |
No |
Yes |
All |
No |
Yes |
All |
time |
day |
|
|
|
|
|
|
Dinner |
Fri |
2.000000 |
2.222222 |
2.166667 |
0.139622 |
0.165347 |
0.158916 |
Sat |
2.555556 |
2.476190 |
2.517241 |
0.158048 |
0.147906 |
0.153152 |
Sun |
2.929825 |
2.578947 |
2.842105 |
0.160113 |
0.187250 |
0.166897 |
Thur |
2.000000 |
NaN |
2.000000 |
0.159744 |
NaN |
0.159744 |
Lunch |
Fri |
3.000000 |
1.833333 |
2.000000 |
0.187735 |
0.188937 |
0.188765 |
Thur |
2.500000 |
2.352941 |
2.459016 |
0.160311 |
0.163863 |
0.161301 |
All |
|
2.668874 |
2.408602 |
2.569672 |
0.159328 |
0.163196 |
0.160803 |
All值为平均数:不单独考虑烟民与非烟民(All列),不单独考虑行分组两个
级别中的任何单项(All行)
要使用其他的聚合函数,将其传给aggfunc即可。例如,使用count或len可以得到有关分组大小的交叉表(计数或频率)
tips.pivot_table('tip_pct', index=['time', 'smoker'],columns='day',aggfunc=len, margins=True)
|
day |
Fri |
Sat |
Sun |
Thur |
All |
time |
smoker |
|
|
|
|
|
Dinner |
No |
3.0 |
45.0 |
57.0 |
1.0 |
106 |
Yes |
9.0 |
42.0 |
19.0 |
NaN |
70 |
Lunch |
No |
1.0 |
NaN |
NaN |
44.0 |
45 |
Yes |
6.0 |
NaN |
NaN |
17.0 |
23 |
All |
|
19.0 |
87.0 |
76.0 |
62.0 |
244 |
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],columns='day', aggfunc='mean', fill_value=0)
|
|
day |
Fri |
Sat |
Sun |
Thur |
time |
size |
smoker |
|
|
|
|
Dinner |
1 |
No |
0.000000 |
0.137931 |
0.000000 |
0.000000 |
Yes |
0.000000 |
0.325733 |
0.000000 |
0.000000 |
2 |
No |
0.139622 |
0.162705 |
0.168859 |
0.159744 |
Yes |
0.171297 |
0.148668 |
0.207893 |
0.000000 |
3 |
No |
0.000000 |
0.154661 |
0.152663 |
0.000000 |
Yes |
0.000000 |
0.144995 |
0.152660 |
0.000000 |
4 |
No |
0.000000 |
0.150096 |
0.148143 |
0.000000 |
Yes |
0.117750 |
0.124515 |
0.193370 |
0.000000 |
5 |
No |
0.000000 |
0.000000 |
0.206928 |
0.000000 |
Yes |
0.000000 |
0.106572 |
0.065660 |
0.000000 |
6 |
No |
0.000000 |
0.000000 |
0.103799 |
0.000000 |
Lunch |
1 |
No |
0.000000 |
0.000000 |
0.000000 |
0.181728 |
Yes |
0.223776 |
0.000000 |
0.000000 |
0.000000 |
2 |
No |
0.000000 |
0.000000 |
0.000000 |
0.166005 |
Yes |
0.181969 |
0.000000 |
0.000000 |
0.158843 |
3 |
No |
0.187735 |
0.000000 |
0.000000 |
0.084246 |
Yes |
0.000000 |
0.000000 |
0.000000 |
0.204952 |
4 |
No |
0.000000 |
0.000000 |
0.000000 |
0.138919 |
Yes |
0.000000 |
0.000000 |
0.000000 |
0.155410 |
5 |
No |
0.000000 |
0.000000 |
0.000000 |
0.121389 |
6 |
No |
0.000000 |
0.000000 |
0.000000 |
0.173706 |
pivot_table参数说明
函数名 |
说明 |
values |
待聚合的列的名称。默认聚合所有数值列 |
index |
用于分组的列名或其他分组键,出现在结果透视表的行 |
columns |
用于分组的列名或其他分组键,出现在结果透视表的列 |
aggfunc |
聚合函数或函数列表,默认mean。可以是任何对groupby有效的函数 |
fill_value |
用于替换表中的缺失值 |
dropna |
如果为True,不添加条目都为NA的列 |
margins |
添加行/列小计和总计,默认为False |
交叉表:crosstab
交叉表(cross-tabulation,简称crosstab)是一种用于计算分组频率的特殊透视表。
data=pd.DataFrame({'Sample':np.arange(1,11),
'Nationality':['USA','Japan','USA','Japan','Japan','Japan','USA','USA','Japan','USA'],
'Handedness':['Right-handed',' Left-handed','Right-handed','Right-handed',' Left-handed','Right-handed','Right-handed','Left-handed','Right-handed','Right-handed']
})
data
|
Sample |
Nationality |
Handedness |
0 |
1 |
USA |
Right-handed |
1 |
2 |
Japan |
Left-handed |
2 |
3 |
USA |
Right-handed |
3 |
4 |
Japan |
Right-handed |
4 |
5 |
Japan |
Left-handed |
5 |
6 |
Japan |
Right-handed |
6 |
7 |
USA |
Right-handed |
7 |
8 |
USA |
Left-handed |
8 |
9 |
Japan |
Right-handed |
9 |
10 |
USA |
Right-handed |
pd.crosstab(data.Nationality, data.Handedness, margins=True)
Handedness |
Left-handed |
Left-handed |
Right-handed |
All |
Nationality |
|
|
|
|
Japan |
2 |
0 |
3 |
5 |
USA |
0 |
1 |
4 |
5 |
All |
2 |
1 |
7 |
10 |
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
|
smoker |
No |
Yes |
All |
time |
day |
|
|
|
Dinner |
Fri |
3 |
9 |
12 |
Sat |
45 |
42 |
87 |
Sun |
57 |
19 |
76 |
Thur |
1 |
0 |
1 |
Lunch |
Fri |
1 |
6 |
7 |
Thur |
44 |
17 |
61 |
All |
|
151 |
93 |
244 |