数据透视表(pivot table) 是一种类似GroupBy的操作方法,常见于Excel中。数据透视表将每一列数据作为输入,输出将数据不断细分为多个维度累计信息的二维数据表。
示例将采用泰坦尼克号的乘客信息数据库来演示,可以在Seaborn程序库获取:
In [1]: import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
In [2]: titanic.head()
Out[2]:
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
In [3]: titanic.shape
Out[3]:(891, 15)
这份数据包含了惨遭厄运的每位乘客的大量信息,包括性别(sex)、年龄(age)、船舱等级(class)和船票价格(fare).
按照性别进行分组,研究性别与生还情况的关系:
In [4]: titanic.groupby("sex")["survived"].mean()
Out[4]:
sex
female 0.742038
male 0.188908
Name: survived, dtype: float64
从数据可以看出:有75%的女性被救,男性中只有19%被救。
如果我们进一步探索,同时观察不同性别与船舱等级的生还情况。根据GroupBy的操作流程,我们也能够实现想要的效果:将船舱等级与性别分组,然后选择生还状态列,应用均值累计函数,再将各组结果组合,最后通过行索引转列索引操作将最里层的行索引换成列索引,形成二维数组。
In [5]: titanic.groupby(["sex","class"])["survived"].mean()
Out[5]:
sex class
female First 0.968085
Second 0.921053
Third 0.500000
male First 0.368852
Second 0.157407
Third 0.135447
Name: survived, dtype: float64
In [6]: titanic.groupby(["sex","class"])["survived"].mean().unstack()
Out[6]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
但是相对于pandas李彤的pivot_table方法,语句要复杂一些。所以使用pivot_table来制作透视表。
DataFrame 的pivot_table 方法的完整签名如下所示:
DataFrame.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False,
dropna=True, margins_name='All')
下面我们来测试一下各个参数:
In [7]: titanic.pivot_table(index='sex', columns='class')
Out[7]:
adult_male age ····
class First Second Third First Second Third ····
sex
female 0.00000 0.000000 0.000000 34.611765 28.722973 21.750000 ····
male 0.97541 0.916667 0.919308 41.281386 30.740707 26.507589 ····
默认对所有列进行聚合,这时我们给与values参数,只计算我们想要的结果:
In [8]: agg = pd.cut(titanic["age"],[0,18,80]) # 对年龄数据列进行分段,便于观看
In [9]: titanic.pivot_table(index=['sex',age], columns='class',values=['survived','fare'])
Out[9]:
fare survived
class First Second Third First Second Third
sex age
female (0, 18] 127.474245 25.064286 17.370835 0.909091 1.000000 0.511628
(18, 80] 105.043469 21.224653 14.785453 0.972973 0.900000 0.423729
male (0, 18] 114.638320 26.116947 20.639055 0.800000 0.600000 0.215686
(18, 80] 68.877389 20.219593 10.022624 0.375000 0.071429 0.133663
在实际使用中,并不一定每次都要均值,这是我们可以使用aggfunc指定累计函数:
In [10]: titanic.pivot_table(index='sex', columns='class',aggfunc={'survived':sum, 'fare':'mean'})
Out[10]:
fare survived
class First Second Third First Second Third
sex
female 106.125798 21.970121 16.118810 91 70 72
male 67.226127 19.741782 12.661633 45 17 47
需要注意的是,这里忽略了一个参数values。当我们为aggfunc指定映射关系的时候,待透视的数据就已经确定了。
当需要计算每一组的总数时,可以通过margins 参数来设置:
In [11]: titanic.pivot_table('survived', index='sex', columns='class', margins=True)
Out[11]:
class First Second Third All
sex
female 0.968085 0.921053 0.500000 0.742038
male 0.368852 0.157407 0.135447 0.188908
All 0.629630 0.472826 0.242363 0.383838
margin 的标签可以通过margins_name 参数进行自定义,默认值是"All"。