数据聚合与分组操作
- 1. GroupBy 机制:拆分-应用-联合
- 1.1 遍历各分组
- 1.2 选择一列或所有列的子集
- 1.3 使用字典和 Series 分组
- 1.4 使用函数分组
- 1.5 根据索引层级分组
- 2. 数据聚合
- 2.1 逐列及多函数应用
- 2.2 返回不含行索引的聚合数据
- 3. 应用:通用拆分 - 应用 - 联合
- 3.1 压缩分组键
- 3.2 分位数与桶分析
- 3.3 示例:使用指定分组值填充缺失值
- 3.4 示例:随机采样与排列
- 3.5 示例:分组加权平均和相关性
- 3.6 逐组线性回归
- 4. 数据透视表与交叉表
- 4.1 数据透视表
- 4.2 交叉表:crosstab
1. GroupBy 机制:拆分-应用-联合
- 数据包含在 pandas 对象中(Series/DataFrame 或其他数据结构),之后根据一个或多个键,在特定的轴上将数据分离到各组中,分组后一个函数应用到各组中产生新的值,并联合为一个结果对象;
- 分组键:与需要分组的轴向长度一致的值列表或数组;DataFrame 的列名的值;可以将分组轴向上的值和分组名称相匹配的字典或 Series;可以在轴索引或索引中的单个标签上调用的函数;
import numpy as np
import pandas as pd
df = pd.DataFrame({'key1': ['a', 'a', 'b', 'b', 'a'],
'key2': ['one', 'two', 'one', 'two', 'one'],
'data1': np.random.randn(5),
'data2': np.random.randn(5)})
grouped = df['data1'].groupby(df['key1'])
df:
|
key1 |
key2 |
data1 |
data2 |
0 |
a |
one |
-0.556877 |
0.300878 |
1 |
a |
two |
0.873811 |
0.742571 |
2 |
b |
one |
1.997530 |
-0.632550 |
3 |
b |
two |
0.966659 |
-1.091297 |
4 |
a |
one |
-0.199236 |
-0.990511 |
grouped:grouped 是一个 GroupBy 对象,还未进行任何的计算
grouped.mean()
key1
a 0.039233
b 1.482094
Name: data1, dtype: float64
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means: 结果为含多层索引的 Series
key1 key2
a one -0.378057
two 0.873811
b one 1.997530
two 0.966659
Name: data1, dtype: float64
means.unstack()
key2 |
one |
two |
key1 |
|
|
a |
-0.378057 |
0.873811 |
b |
1.997530 |
0.966659 |
states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
df['data1'].groupby([states, years]).mean()
California 2005 0.873811
2006 1.997530
Ohio 2005 0.204891
2006 -0.199236
Name: data1, dtype: float64
df.groupby('key1').mean()
|
data1 |
data2 |
key1 |
|
|
a |
0.039233 |
0.017646 |
b |
1.482094 |
-0.861923 |
df.groupby(['key1', 'key2']).mean()
|
|
data1 |
data2 |
key1 |
key2 |
|
|
a |
one |
-0.378057 |
-0.344816 |
two |
0.873811 |
0.742571 |
b |
one |
1.997530 |
-0.632550 |
two |
0.966659 |
-1.091297 |
1.1 遍历各分组
- GroupBy 对象支持迭代,会生成一个包含组名和数据块的2维元组序列
for name, group in df.groupby('key1'):
print(name)
print(group)
a
key1 key2 data1 data2
0 a one -0.556877 0.300878
1 a two 0.873811 0.742571
4 a one -0.199236 -0.990511
b
key1 key2 data1 data2
2 b one 1.997530 -0.632550
3 b two 0.966659 -1.091297
- 在多个分组键的情况下,元组中的第一个元素是键值的元组
for (k1, k2), group in df.groupby(['key1', 'key2']):
print((k1, k2))
print(group)
('a', 'one')
key1 key2 data1 data2
0 a one -0.556877 0.300878
4 a one -0.199236 -0.990511
('a', 'two')
key1 key2 data1 data2
1 a two 0.873811 0.742571
('b', 'one')
key1 key2 data1 data2
2 b one 1.99753 -0.63255
('b', 'two')
key1 key2 data1 data2
3 b two 0.966659 -1.091297
pieces = dict(list(df.groupby('key1')))
pieces
{'a': key1 key2 data1 data2
0 a one -0.556877 0.300878
1 a two 0.873811 0.742571
4 a one -0.199236 -0.990511, 'b': key1 key2 data1 data2
2 b one 1.997530 -0.632550
3 b two 0.966659 -1.091297}
pieces['b']
|
key1 |
key2 |
data1 |
data2 |
2 |
b |
one |
1.997530 |
-0.632550 |
3 |
b |
two |
0.966659 |
-1.091297 |
- 默认情况下,groupby 在 axis=0 的轴上分组,使用 axis=1 对列进行分组
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
print(dtype)
print(group)
float64
data1 data2
0 -0.556877 0.300878
1 0.873811 0.742571
2 1.997530 -0.632550
3 0.966659 -1.091297
4 -0.199236 -0.990511
object
key1 key2
0 a one
1 a two
2 b one
3 b two
4 a one
1.2 选择一列或所有列的子集
- 将从 DataFrame 创建的 GroupBy 对象用列名称或列名称的数组进行索引时,会产生用于聚合的列子集的效果,即下面两种方法产生的结果是一样的:
df.groupby(['key1', 'key2'])[['data2']].mean()
df[['data2']].groupby([df['key1'],df['key2']]).mean()
|
|
data2 |
key1 |
key2 |
|
a |
one |
-0.344816 |
two |
0.742571 |
b |
one |
-0.632550 |
two |
-1.091297 |
- 如果传递的是列表或数组,则此索引操作返回的对象是分组的 DataFrame(如上), 如果只有单个列名作为标量传递,则为分组的 Series(如下):
s_grouped = df.groupby(['key1', 'key2'])['data2']
s_grouped.mean()
key1 key2
a one -0.344816
two 0.742571
b one -0.632550
two -1.091297
Name: data2, dtype: float64
1.3 使用字典和 Series 分组
people = pd.DataFrame(np.random.randn(5, 5),
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'],
columns=['a', 'b', 'c', 'd', 'e'])
people.iloc[2:3, [1, 2]] = np.nan
people
|
a |
b |
c |
d |
e |
Joe |
0.635451 |
-0.146126 |
-0.403298 |
-1.305932 |
0.049308 |
Steve |
-0.343671 |
-1.237591 |
0.765479 |
-0.123180 |
0.095394 |
Wes |
-0.969406 |
NaN |
NaN |
0.083540 |
-0.497828 |
Jim |
1.758974 |
-0.234628 |
-0.631201 |
1.326421 |
0.075890 |
Travis |
0.413102 |
1.506319 |
-0.899817 |
0.630404 |
-1.457132 |
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
'd': 'blue', 'e': 'red', 'f': 'orange'}
by_column = people.groupby(mapping, axis=1)
by_column.sum()
|
blue |
red |
Joe |
-1.709229 |
0.538633 |
Steve |
0.642299 |
-1.485868 |
Wes |
0.083540 |
-1.467234 |
Jim |
0.695220 |
1.600236 |
Travis |
-0.269413 |
0.462289 |
- Series 也有相同的功能,可以视为固定大小的映射:
map_series = pd.Series(mapping)
people.groupby(map_series, axis=1).count()
|
blue |
red |
Joe |
2 |
3 |
Steve |
2 |
3 |
Wes |
1 |
2 |
Jim |
2 |
3 |
Travis |
2 |
3 |
1.4 使用函数分组
- 作为分组键传递的函数将会按照每个索引值调用一次,同时返回值会被用作分组名称。如下示例为根据上述 DataFrame 中人的名字的长度进行分组:
people.groupby(len).sum()
|
a |
b |
c |
d |
e |
3 |
1.425019 |
-0.380754 |
-1.034499 |
0.104029 |
-0.372630 |
5 |
-0.343671 |
-1.237591 |
0.765479 |
-0.123180 |
0.095394 |
6 |
0.413102 |
1.506319 |
-0.899817 |
0.630404 |
-1.457132 |
- 将函数与数组、字典或Series混合并不困难,所有的对象都会在内部转换为数组:
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()
|
|
a |
b |
c |
d |
e |
3 |
one |
-0.969406 |
-0.146126 |
-0.403298 |
-1.305932 |
-0.497828 |
two |
1.758974 |
-0.234628 |
-0.631201 |
1.326421 |
0.075890 |
5 |
one |
-0.343671 |
-1.237591 |
0.765479 |
-0.123180 |
0.095394 |
6 |
two |
0.413102 |
1.506319 |
-0.899817 |
0.630404 |
-1.457132 |
1.5 根据索引层级分组
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
[1, 3, 5, 1, 3]],
names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df.groupby(level='cty', axis=1).count()
cty |
JP |
US |
0 |
2 |
3 |
1 |
2 |
3 |
2 |
2 |
3 |
3 |
2 |
3 |
2. 数据聚合
- 聚合是指根据数组产生标量值的数据转换过程,常见的聚合如下表:
函数名 |
描述 |
count |
分组中的非 NA 值的数量 |
sum |
非 NA 值的累和 |
mean |
非 NA 值的均值 |
median |
非 NA 值的算术中位数 |
std, var |
无偏的 (n-1 分母)标准差和方差 |
min, max |
非 NA 值的最小值、最大值 |
prod |
非 NA 值的乘积 |
first, last |
非 NA 值的第一个和最后一个值 |
- 除上述常用聚合外,还可以使用自制的聚合,如下使用 quantile 计算样本分位数,虽然 quantile 不是显式地为 GroupBy 对象实现的,但它是 Series 的方法,因此也可用于聚合。在内部,GroupBy 有效地对 Series 进行切片,为每一块调用 piece.quantile(0.9),然后将这些结果一起组装到结果对象中:
df:
|
key1 |
key2 |
data1 |
data2 |
0 |
a |
one |
-0.556877 |
0.300878 |
1 |
a |
two |
0.873811 |
0.742571 |
2 |
b |
one |
1.997530 |
-0.632550 |
3 |
b |
two |
0.966659 |
-1.091297 |
4 |
a |
one |
-0.199236 |
-0.990511 |
grouped = df.groupby('key1')
grouped['data1'].quantile(0.9)
key1
a 0.659202
b 1.894443
Name: data1, dtype: float64
- 要使用自己的聚合函数,需要将函数传递给 aggregate 或 agg 方法
def peak_to_peak(arr):
return arr.max() - arr.min()
grouped.agg(peak_to_peak)
|
data1 |
data2 |
key1 |
|
|
a |
1.430689 |
1.733082 |
b |
1.030872 |
0.458747 |
2.1 逐列及多函数应用
- 回到之前的小费数据集,我们增加一列小费比例列,然后根据各列同时使用多个函数进行聚合:
tips = pd.read_csv(r'C:/Users/Raymone/Data Analysis/examples/tips.csv')
tips['tip_pct'] = tips['tip'] / tips['total_bill']
grouped = tips.groupby(['day', 'smoker'])
grouped_pct = grouped['tip_pct']
grouped_pct.agg(['mean', 'std', peak_to_peak])
|
|
mean |
std |
peak_to_peak |
day |
smoker |
|
|
|
Fri |
No |
0.151650 |
0.028123 |
0.067349 |
Yes |
0.174783 |
0.051293 |
0.159925 |
Sat |
No |
0.158048 |
0.039767 |
0.235193 |
Yes |
0.147906 |
0.061375 |
0.290095 |
Sun |
No |
0.160113 |
0.042347 |
0.193226 |
Yes |
0.187250 |
0.154134 |
0.644685 |
Thur |
No |
0.160298 |
0.038774 |
0.193350 |
Yes |
0.163863 |
0.039389 |
0.151240 |
- 默认结果中使用函数名作为列名,可以使用传递元组的列表的方式来指定结果的列名:
grouped_pct.agg([('foo', 'mean'), ('bar', np.std)])
|
|
foo |
bar |
day |
smoker |
|
|
Fri |
No |
0.151650 |
0.028123 |
Yes |
0.174783 |
0.051293 |
Sat |
No |
0.158048 |
0.039767 |
Yes |
0.147906 |
0.061375 |
Sun |
No |
0.160113 |
0.042347 |
Yes |
0.187250 |
0.154134 |
Thur |
No |
0.160298 |
0.038774 |
Yes |
0.163863 |
0.039389 |
- 对于 DataFrame,如下方式可以指定应用到所有列上的函数列表
functions = ['count', 'mean', 'max']
result = grouped['tip_pct', 'total_bill'].agg(functions)
result
|
|
tip_pct |
total_bill |
|
|
count |
mean |
max |
count |
mean |
max |
day |
smoker |
|
|
|
|
|
|
Fri |
No |
4 |
0.151650 |
0.187735 |
4 |
18.420000 |
22.75 |
Yes |
15 |
0.174783 |
0.263480 |
15 |
16.813333 |
40.17 |
Sat |
No |
45 |
0.158048 |
0.291990 |
45 |
19.661778 |
48.33 |
Yes |
42 |
0.147906 |
0.325733 |
42 |
21.276667 |
50.81 |
Sun |
No |
57 |
0.160113 |
0.252672 |
57 |
20.506667 |
48.17 |
Yes |
19 |
0.187250 |
0.710345 |
19 |
24.120000 |
45.35 |
Thur |
No |
45 |
0.160298 |
0.266312 |
45 |
17.113111 |
41.19 |
Yes |
17 |
0.163863 |
0.241255 |
17 |
19.190588 |
43.11 |
- 将不同的函数应用到一个或多个列上,需要将含有列名与函数关系的字典传递给 agg
grouped.agg({'tip_pct': ['min', 'max', 'mean', 'std'], 'size': 'sum'})
|
|
tip_pct |
size |
|
|
min |
max |
mean |
std |
sum |
day |
smoker |
|
|
|
|
|
Fri |
No |
0.120385 |
0.187735 |
0.151650 |
0.028123 |
9 |
Yes |
0.103555 |
0.263480 |
0.174783 |
0.051293 |
31 |
Sat |
No |
0.056797 |
0.291990 |
0.158048 |
0.039767 |
115 |
Yes |
0.035638 |
0.325733 |
0.147906 |
0.061375 |
104 |
Sun |
No |
0.059447 |
0.252672 |
0.160113 |
0.042347 |
167 |
Yes |
0.065660 |
0.710345 |
0.187250 |
0.154134 |
49 |
Thur |
No |
0.072961 |
0.266312 |
0.160298 |
0.038774 |
112 |
Yes |
0.090014 |
0.241255 |
0.163863 |
0.039389 |
40 |
2.2 返回不含行索引的聚合数据
- 通过 as_index 参数选择返回的数据是否包含行索引
tips.groupby(['day', 'smoker'], as_index=False).mean()
|
day |
smoker |
total_bill |
tip |
size |
tip_pct |
0 |
Fri |
No |
18.420000 |
2.812500 |
2.250000 |
0.151650 |
1 |
Fri |
Yes |
16.813333 |
2.714000 |
2.066667 |
0.174783 |
2 |
Sat |
No |
19.661778 |
3.102889 |
2.555556 |
0.158048 |
3 |
Sat |
Yes |
21.276667 |
2.875476 |
2.476190 |
0.147906 |
4 |
Sun |
No |
20.506667 |
3.167895 |
2.929825 |
0.160113 |
5 |
Sun |
Yes |
24.120000 |
3.516842 |
2.578947 |
0.187250 |
6 |
Thur |
No |
17.113111 |
2.673778 |
2.488889 |
0.160298 |
7 |
Thur |
Yes |
19.190588 |
3.030000 |
2.352941 |
0.163863 |
tips.groupby(['day', 'smoker']).mean()
|
|
total_bill |
tip |
size |
tip_pct |
day |
smoker |
|
|
|
|
Fri |
No |
18.420000 |
2.812500 |
2.250000 |
0.151650 |
Yes |
16.813333 |
2.714000 |
2.066667 |
0.174783 |
Sat |
No |
19.661778 |
3.102889 |
2.555556 |
0.158048 |
Yes |
21.276667 |
2.875476 |
2.476190 |
0.147906 |
Sun |
No |
20.506667 |
3.167895 |
2.929825 |
0.160113 |
Yes |
24.120000 |
3.516842 |
2.578947 |
0.187250 |
Thur |
No |
17.113111 |
2.673778 |
2.488889 |
0.160298 |
Yes |
19.190588 |
3.030000 |
2.352941 |
0.163863 |
3. 应用:通用拆分 - 应用 - 联合
- GroupBy 方法最常见的目的是 apply(应用),apply 将对象拆分为多块,然后在每一块上调用传递的函数,之后尝试将每一块拼接到一起。如下为选出之前的小费数据集中小费百分比最高的五组:
def top(df, n=5, column='tip_pct'):
return df.sort_values(by=column)[-n:]
top(tips, n=5)
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
- 然后按照 smoker 分组,并调用 apply,则 top 函数在 DataFrame 的每一行分组上被调用,之后用 pandas.concat 将函数结果粘贴在一起,并使用分组名作为各组的标签。
tips.groupby('smoker').apply(top)
|
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
smoker |
|
|
|
|
|
|
|
|
No |
88 |
24.71 |
5.85 |
No |
Thur |
Lunch |
2 |
0.236746 |
185 |
20.69 |
5.00 |
No |
Sun |
Dinner |
5 |
0.241663 |
51 |
10.29 |
2.60 |
No |
Sun |
Dinner |
2 |
0.252672 |
149 |
7.51 |
2.00 |
No |
Thur |
Lunch |
2 |
0.266312 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
Yes |
109 |
14.31 |
4.00 |
Yes |
Sat |
Dinner |
2 |
0.279525 |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
tips.groupby(['smoker', 'day']).apply(top, n=1, column='total_bill')
|
|
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
smoker |
day |
|
|
|
|
|
|
|
|
No |
Fri |
94 |
22.75 |
3.25 |
No |
Fri |
Dinner |
2 |
0.142857 |
Sat |
212 |
48.33 |
9.00 |
No |
Sat |
Dinner |
4 |
0.186220 |
Sun |
156 |
48.17 |
5.00 |
No |
Sun |
Dinner |
6 |
0.103799 |
Thur |
142 |
41.19 |
5.00 |
No |
Thur |
Lunch |
5 |
0.121389 |
Yes |
Fri |
95 |
40.17 |
4.73 |
Yes |
Fri |
Dinner |
4 |
0.117750 |
Sat |
170 |
50.81 |
10.00 |
Yes |
Sat |
Dinner |
3 |
0.196812 |
Sun |
182 |
45.35 |
3.50 |
Yes |
Sun |
Dinner |
3 |
0.077178 |
Thur |
197 |
43.11 |
5.00 |
Yes |
Thur |
Lunch |
4 |
0.115982 |
3.1 压缩分组键
- 通过 group_keys 参数决定是否允许分组键称为索引
tips.groupby('smoker', group_keys=False).apply(top)
|
total_bill |
tip |
smoker |
day |
time |
size |
tip_pct |
88 |
24.71 |
5.85 |
No |
Thur |
Lunch |
2 |
0.236746 |
185 |
20.69 |
5.00 |
No |
Sun |
Dinner |
5 |
0.241663 |
51 |
10.29 |
2.60 |
No |
Sun |
Dinner |
2 |
0.252672 |
149 |
7.51 |
2.00 |
No |
Thur |
Lunch |
2 |
0.266312 |
232 |
11.61 |
3.39 |
No |
Sat |
Dinner |
2 |
0.291990 |
109 |
14.31 |
4.00 |
Yes |
Sat |
Dinner |
2 |
0.279525 |
183 |
23.17 |
6.50 |
Yes |
Sun |
Dinner |
4 |
0.280535 |
67 |
3.07 |
1.00 |
Yes |
Sat |
Dinner |
1 |
0.325733 |
178 |
9.60 |
4.00 |
Yes |
Sun |
Dinner |
2 |
0.416667 |
172 |
7.25 |
5.15 |
Yes |
Sun |
Dinner |
2 |
0.710345 |
3.2 分位数与桶分析
- cut 和 qcut 与 groupby 方法一起使用可以对数据集更方便地进行分桶或分位分析。考虑一个简单的随机数据集和一个使用cut的等长桶分类:
frame = pd.DataFrame({'data1': np.random.randn(1000),
'data2': np.random.randn(1000)})
quartiles = pd.cut(frame.data1, 4)
def get_stats(group):
return {'min': group.min(), 'max': group.max(),
'count': group.count(), 'mean': group.mean()}
grouped = frame.data2.groupby(quartiles)
grouped.apply(get_stats)
data1
(-3.745, -1.961] count 27.000000
max 2.268562
mean -0.071299
min -1.510138
(-1.961, -0.185] count 419.000000
max 3.113571
mean -0.036838
min -2.655437
(-0.185, 1.592] count 491.000000
max 2.738569
mean -0.010014
min -2.702930
(1.592, 3.368] count 63.000000
max 1.585453
mean -0.131166
min -2.354267
Name: data2, dtype: float64
grouped.apply(get_stats).unstack()
|
count |
max |
mean |
min |
data1 |
|
|
|
|
(-3.745, -1.961] |
27.0 |
2.268562 |
-0.071299 |
-1.510138 |
(-1.961, -0.185] |
419.0 |
3.113571 |
-0.036838 |
-2.655437 |
(-0.185, 1.592] |
491.0 |
2.738569 |
-0.010014 |
-2.702930 |
(1.592, 3.368] |
63.0 |
1.585453 |
-0.131166 |
-2.354267 |
- 根据样本分位数计算出等大小的桶,需要使用 qcut(cut: 根据值的大小范围分箱;qcut:根据数量分箱(得到等数量的箱体));
- 传递 labels=False 来获得分位数的数值(第几个箱,而不是箱区间);
grouping = pd.qcut(frame.data1, 10, labels=False)
grouped = frame.data2.groupby(grouping)
grouped.apply(get_stats).unstack()
|
count |
max |
mean |
min |
data1 |
|
|
|
|
0 |
100.0 |
2.508341 |
0.067957 |
-2.055184 |
1 |
100.0 |
2.045575 |
0.045618 |
-2.454061 |
2 |
100.0 |
3.113571 |
-0.092481 |
-2.655437 |
3 |
100.0 |
2.224693 |
-0.066634 |
-2.460681 |
4 |
100.0 |
2.361963 |
-0.245740 |
-2.305434 |
5 |
100.0 |
1.807228 |
-0.073544 |
-2.590441 |
6 |
100.0 |
2.738569 |
0.048381 |
-2.682698 |
7 |
100.0 |
2.443707 |
0.038001 |
-2.609242 |
8 |
100.0 |
2.696874 |
0.093630 |
-2.702930 |
9 |
100.0 |
1.828048 |
-0.120592 |
-2.492170 |
3.3 示例:使用指定分组值填充缺失值
s = pd.Series(np.random.randn(6))
s[::2] = np.nan
s.fillna(s.mean())
0 0.844620
1 0.985056
2 0.844620
3 0.075285
4 0.844620
5 1.473518
dtype: float64
- 需要填充值按组变化,一个方法是对数据分组后使用 apply,在每个数据块上都调用 fillna 函数
states = ['Ohio', 'New York', 'Vermont', 'Florida',
'Oregon', 'Nevada', 'California', 'Idaho']
group_key = ['East'] * 4 + ['West'] * 4
data = pd.Series(np.random.randn(8), index=states)
data[['Vermont', 'Nevada', 'Idaho']] = np.nan
data.groupby(group_key).mean()
fill_mean = lambda g: g.fillna(g.mean())
data.groupby(group_key).apply(fill_mean)
Ohio 1.875761
New York -0.467642
Vermont -0.243175
Florida -2.137644
Oregon -1.650985
Nevada -0.738496
California 0.173993
Idaho -0.738496
dtype: float64
- 另一种情况,若已经对每个分组预定义了填充值,可以使用每个分组内的 name 属性:
fill_values = {'East': 0.5, 'West': -1}
fill_func = lambda g: g.fillna(fill_values[g.name])
data.groupby(group_key).apply(fill_func)
Ohio 1.875761
New York -0.467642
Vermont 0.500000
Florida -2.137644
Oregon -1.650985
Nevada -1.000000
California 0.173993
Idaho -1.000000
dtype: float64
3.4 示例:随机采样与排列
- 从大数据集中抽取随机样本,使用 series 的 sample 方法,如下构造一副扑克牌,并从中随机抽取 5 张:
suits = ['H', 'S', 'C', 'D']
card_val = (list(range(1, 11)) + [10] * 3) * 4
base_names = ['A'] + list(range(2, 11)) + ['J', 'K', 'Q']
cards = []
for suit in suits:
cards.extend(str(num) + suit for num in base_names)
deck = pd.Series(card_val, index=cards)
def draw(deck, n=5):
return deck.sample(n)
draw(deck)
3D 3
2H 2
3H 3
9S 9
2C 2
dtype: int64
- 从每个花色中随机抽取两张牌,由于花色是牌名的最后一个字符,可以基于这一点进行分组:
get_suit = lambda card: card[-1]
deck.groupby(get_suit).apply(draw, n=2)
C AC 1
QC 10
D 10D 10
5D 5
H 2H 2
9H 9
S 9S 9
4S 4
dtype: int64
deck.groupby(get_suit, group_keys=False).apply(draw, n=2)
AC 1
10C 10
8D 8
5D 5
4H 4
QH 10
8S 8
10S 10
dtype: int64
3.5 示例:分组加权平均和相关性
- DataFrame 的列间操作或两个Series之间的分组加权平均
df = pd.DataFrame({'category': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
'data': np.random.randn(8),
'weights': np.random.rand(8)})
grouped = df.groupby('category')
get_wavg = lambda g: np.average(g['data'], weights=g['weights'])
grouped.apply(get_wavg)
category
a 0.689109
b 0.066072
dtype: float64
df
|
category |
data |
weights |
0 |
a |
0.932987 |
0.480867 |
1 |
a |
-2.439921 |
0.291352 |
2 |
a |
-0.593844 |
0.021750 |
3 |
a |
2.441942 |
0.469114 |
4 |
b |
-0.540927 |
0.473682 |
5 |
b |
0.304671 |
0.353316 |
6 |
b |
-0.479385 |
0.855175 |
7 |
b |
1.361458 |
0.516976 |
- 另一个例子,一个包含标普500和股票的收盘价的数据,计算一个 DataFrame,它包含标普指数每日收益的年度相关性
close_px = pd.read_csv(r'C:/Users/Raymone/Data Analysis/examples/stock_px_2.csv', parse_dates=True, index_col=0)
spx_corr = lambda x: x.corrwith(x['SPX'])
rets = close_px.pct_change().dropna()
get_year = lambda x: x.year
by_year = rets.groupby(get_year)
by_year.apply(spx_corr)
|
AAPL |
MSFT |
XOM |
SPX |
2003 |
0.541124 |
0.745174 |
0.661265 |
1.0 |
2004 |
0.374283 |
0.588531 |
0.557742 |
1.0 |
2005 |
0.467540 |
0.562374 |
0.631010 |
1.0 |
2006 |
0.428267 |
0.406126 |
0.518514 |
1.0 |
2007 |
0.508118 |
0.658770 |
0.786264 |
1.0 |
2008 |
0.681434 |
0.804626 |
0.828303 |
1.0 |
2009 |
0.707103 |
0.654902 |
0.797921 |
1.0 |
2010 |
0.710105 |
0.730118 |
0.839057 |
1.0 |
2011 |
0.691931 |
0.800996 |
0.859975 |
1.0 |
close_px[-4:]
|
AAPL |
MSFT |
XOM |
SPX |
2011-10-11 |
400.29 |
27.00 |
76.27 |
1195.54 |
2011-10-12 |
402.19 |
26.96 |
77.16 |
1207.25 |
2011-10-13 |
408.43 |
27.18 |
76.37 |
1203.66 |
2011-10-14 |
422.00 |
27.27 |
78.11 |
1224.58 |
by_year.apply(lambda g: g['AAPL'].corr(g['MSFT']))
2003 0.480868
2004 0.259024
2005 0.300093
2006 0.161735
2007 0.417738
2008 0.611901
2009 0.432738
2010 0.571946
2011 0.581987
dtype: float64
3.6 逐组线性回归
- 只要函数返回一个pandas对象或标量值,就可以使用groupby执行更复杂的按组统计分析
import statsmodels.api as sm
def regress(data, yvar, xvars):
Y = data[yvar]
X = data[xvars]
X['intercept'] = 1.
result = sm.OLS(Y, X).fit()
return result.params
by_year.apply(regress, 'AAPL', ['SPX'])
|
SPX |
intercept |
2003 |
1.195406 |
0.000710 |
2004 |
1.363463 |
0.004201 |
2005 |
1.766415 |
0.003246 |
2006 |
1.645496 |
0.000080 |
2007 |
1.198761 |
0.003438 |
2008 |
0.968016 |
-0.001110 |
2009 |
0.879103 |
0.002954 |
2010 |
1.052608 |
0.001261 |
2011 |
0.806605 |
0.001514 |
4. 数据透视表与交叉表
4.1 数据透视表
-
Python 中的 pandas 透视表是通过 groupby 工具以及使用分层索引的重塑操作实现的。DataFrame 有一个 pivot_table 方法,并且还有一个顶层的 pandas.pivot_table 函数。除了为 groupby 提供一个方便接口,pivot_table 还可以添加部分总计,也称作边距。
-
在之前的小费数据集中,计算一张在行方向上按 day 和 smoker 排列的分组平均值(默认的 pivot_table 聚合类型)的表
tips.pivot_table(index=['day', 'smoker'])
|
|
size |
tip |
tip_pct |
total_bill |
day |
smoker |
|
|
|
|
Fri |
No |
2.250000 |
2.812500 |
0.151650 |
18.420000 |
Yes |
2.066667 |
2.714000 |
0.174783 |
16.813333 |
Sat |
No |
2.555556 |
3.102889 |
0.158048 |
19.661778 |
Yes |
2.476190 |
2.875476 |
0.147906 |
21.276667 |
Sun |
No |
2.929825 |
3.167895 |
0.160113 |
20.506667 |
Yes |
2.578947 |
3.516842 |
0.187250 |
24.120000 |
Thur |
No |
2.488889 |
2.673778 |
0.160298 |
17.113111 |
Yes |
2.352941 |
3.030000 |
0.163863 |
19.190588 |
- 只想在 tip_pct 和 size 上进行聚合,并根据 time 分组:
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')
|
|
size |
tip_pct |
|
smoker |
No |
Yes |
No |
Yes |
time |
day |
|
|
|
|
Dinner |
Fri |
2.000000 |
2.222222 |
0.139622 |
0.165347 |
Sat |
2.555556 |
2.476190 |
0.158048 |
0.147906 |
Sun |
2.929825 |
2.578947 |
0.160113 |
0.187250 |
Thur |
2.000000 |
NaN |
0.159744 |
NaN |
Lunch |
Fri |
3.000000 |
1.833333 |
0.187735 |
0.188937 |
Thur |
2.500000 |
2.352941 |
0.160311 |
0.163863 |
- 通过传递 margins=True 来包含部分总计
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True)
|
|
size |
tip_pct |
|
smoker |
No |
Yes |
All |
No |
Yes |
All |
time |
day |
|
|
|
|
|
|
Dinner |
Fri |
2.000000 |
2.222222 |
2.166667 |
0.139622 |
0.165347 |
0.158916 |
Sat |
2.555556 |
2.476190 |
2.517241 |
0.158048 |
0.147906 |
0.153152 |
Sun |
2.929825 |
2.578947 |
2.842105 |
0.160113 |
0.187250 |
0.166897 |
Thur |
2.000000 |
NaN |
2.000000 |
0.159744 |
NaN |
0.159744 |
Lunch |
Fri |
3.000000 |
1.833333 |
2.000000 |
0.187735 |
0.188937 |
0.188765 |
Thur |
2.500000 |
2.352941 |
2.459016 |
0.160311 |
0.163863 |
0.161301 |
All |
|
2.668874 |
2.408602 |
2.569672 |
0.159328 |
0.163196 |
0.160803 |
- 若要使用不同的聚合函数,需将函数传递给 aggfunc
tips.pivot_table('tip_pct', index=['time', 'smoker'], columns='day', aggfunc=len, margins=True)
|
day |
Fri |
Sat |
Sun |
Thur |
All |
time |
smoker |
|
|
|
|
|
Dinner |
No |
3.0 |
45.0 |
57.0 |
1.0 |
106.0 |
Yes |
9.0 |
42.0 |
19.0 |
NaN |
70.0 |
Lunch |
No |
1.0 |
NaN |
NaN |
44.0 |
45.0 |
Yes |
6.0 |
NaN |
NaN |
17.0 |
23.0 |
All |
|
19.0 |
87.0 |
76.0 |
62.0 |
244.0 |
tips.pivot_table('tip_pct', index=['time', 'size', 'smoker'],
columns='day', aggfunc='mean', fill_value=0)
|
|
day |
Fri |
Sat |
Sun |
Thur |
time |
size |
smoker |
|
|
|
|
Dinner |
1 |
No |
0.000000 |
0.137931 |
0.000000 |
0.000000 |
Yes |
0.000000 |
0.325733 |
0.000000 |
0.000000 |
2 |
No |
0.139622 |
0.162705 |
0.168859 |
0.159744 |
Yes |
0.171297 |
0.148668 |
0.207893 |
0.000000 |
3 |
No |
0.000000 |
0.154661 |
0.152663 |
0.000000 |
Yes |
0.000000 |
0.144995 |
0.152660 |
0.000000 |
4 |
No |
0.000000 |
0.150096 |
0.148143 |
0.000000 |
Yes |
0.117750 |
0.124515 |
0.193370 |
0.000000 |
5 |
No |
0.000000 |
0.000000 |
0.206928 |
0.000000 |
Yes |
0.000000 |
0.106572 |
0.065660 |
0.000000 |
6 |
No |
0.000000 |
0.000000 |
0.103799 |
0.000000 |
Lunch |
1 |
No |
0.000000 |
0.000000 |
0.000000 |
0.181728 |
Yes |
0.223776 |
0.000000 |
0.000000 |
0.000000 |
2 |
No |
0.000000 |
0.000000 |
0.000000 |
0.166005 |
Yes |
0.181969 |
0.000000 |
0.000000 |
0.158843 |
3 |
No |
0.187735 |
0.000000 |
0.000000 |
0.084246 |
Yes |
0.000000 |
0.000000 |
0.000000 |
0.204952 |
4 |
No |
0.000000 |
0.000000 |
0.000000 |
0.138919 |
Yes |
0.000000 |
0.000000 |
0.000000 |
0.155410 |
5 |
No |
0.000000 |
0.000000 |
0.000000 |
0.121389 |
6 |
No |
0.000000 |
0.000000 |
0.000000 |
0.173706 |
pivot_table 选项:
选项名 |
描述 |
values |
需要聚合的列名,默认情况下聚合所有数值型的列 |
index |
在结果透视表的行上进行分组的列名或其他分组键 |
columns |
在结果透视表的列上进行分组的列名或其他分组键 |
aggfuc |
聚合函数或函数列表(默认情况下是 ‘mean’)可以是 groupby 上下文的任意有效函数 |
fill_value |
在结果表中替换缺失值的值 |
dropna |
若为 True,将不含所有条目均为 NA 的列 |
margins |
添加行/列小计和总计,默认为 False |
4.2 交叉表:crosstab
- 交叉表是数据透视表的一种特殊情况,计算的是分组中的频率(计数):
data = pd.DataFrame({'Sample': np.arange(10),
'Nationality': ['USA', 'Japan', 'USA', 'Japan', 'Japan', 'Japan', 'USA', 'USA', 'Japan', 'USA'],
'Handedness': ['Right-handed', 'Left_handed', 'Right-handed', 'Right-handed', 'Left_handed', 'Right-handed',
'Right-handed', 'Left_handed', 'Right-handed', 'Right-handed']})
pd.crosstab(data.Nationality, data.Handedness, margins=True)
Handedness |
Left_handed |
Right-handed |
All |
Nationality |
|
|
|
Japan |
2 |
3 |
5 |
USA |
1 |
4 |
5 |
All |
3 |
7 |
10 |
- crosstable 的前两个参数可以是数组,Series 或数组的列表,如在小费数据中可以这么做:
pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)
|
smoker |
No |
Yes |
All |
time |
day |
|
|
|
Dinner |
Fri |
3 |
9 |
12 |
Sat |
45 |
42 |
87 |
Sun |
57 |
19 |
76 |
Thur |
1 |
0 |
1 |
Lunch |
Fri |
1 |
6 |
7 |
Thur |
44 |
17 |
61 |
All |
|
151 |
93 |
244 |