小结:
pandas库的参数变了不少,一定要先去官网升级到最新版的pandas,然后下载一个详细RN用于查询。
- 透视表pivot_table
- 交叉表 crosstab
1. 透视表pivot_table
根据一个键或多个键做数据聚合,默认类型是:根据键值/键值对,计算分组平均数
常规引入相关库
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
注意pivot_table 可以在DadaFrame或pandas中调用,以下是在DadaFrame中调用。
# 读取数据并确定是DataFrame格式
In [10]: tips = pd.read_csv('ch08/tips.csv')
In [11]: tips['tip_pct'] = tips['tip'] / tips['total_bill']
In [12]: tips[:10]
Out[12]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
5 25.29 4.71 Male No Sun Dinner 4 0.186240
6 8.77 2.00 Male No Sun Dinner 2 0.228050
7 26.88 3.12 Male No Sun Dinner 4 0.116071
8 15.04 1.96 Male No Sun Dinner 2 0.130319
9 14.78 3.23 Male No Sun Dinner 2 0.218539
In [5]: type(tips)
Out[5]: pandas.core.frame.DataFrame
# 普通透视表,设定index
In [6]: tips.pivot_table(index=['sex','smoker'])
Out[6]:
size tip tip_pct total_bill
sex smoker
Female No 2.592593 2.773519 0.156921 18.105185
Yes 2.242424 2.931515 0.182150 17.977879
Male No 2.711340 3.113402 0.160669 19.791237
Yes 2.500000 3.051167 0.152771 22.284500
# 多层透视表, margins 作用是多加一个ALL列,不考虑分组级别中的差异
In [9]: tips.pivot_table( ['tip_pct','size'], index = ['sex','day'],
...: columns ='smoker')
...:
...:
Out[9]:
size tip_pct
smoker No Yes No Yes
sex day
Female Fri 2.500000 2.000000 0.165296 0.209129
Sat 2.307692 2.200000 0.147993 0.163817
Sun 3.071429 2.500000 0.165710 0.237075
Thur 2.480000 2.428571 0.155971 0.163073
Male Fri 2.000000 2.125000 0.138005 0.144730
Sat 2.656250 2.629630 0.162132 0.139067
Sun 2.883721 2.600000 0.158291 0.173964
Thur 2.500000 2.300000 0.165706 0.164417
In [10]:
In [10]: tips.pivot_table( ['tip_pct','size'], index = ['sex','day'],
...: columns ='smoker', margins = True)
...:
...:
...:
Out[10]:
size tip_pct
smoker No Yes All No Yes All
sex day
Female Fri 2.500000 2.000000 2.111111 0.165296 0.209129 0.199388
Sat 2.307692 2.200000 2.250000 0.147993 0.163817 0.156470
Sun 3.071429 2.500000 2.944444 0.165710 0.237075 0.181569
Thur 2.480000 2.428571 2.468750 0.155971 0.163073 0.157525
Male Fri 2.000000 2.125000 2.100000 0.138005 0.144730 0.143385
Sat 2.656250 2.629630 2.644068 0.162132 0.139067 0.151577
Sun 2.883721 2.600000 2.810345 0.158291 0.173964 0.162344
Thur 2.500000 2.300000 2.433333 0.165706 0.164417 0.165276
All 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803
# 一个对比例子, 普通透视表,margins 参数, fill_value填充空值,aggfunc使用其他函数(还能用lambda或自定义函数)
In [11]: tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day')
Out[11]:
day Fri Sat Sun Thur
sex smoker
Female No 0.165296 0.147993 0.165710 0.155971
Yes 0.209129 0.163817 0.237075 0.163073
Male No 0.138005 0.162132 0.158291 0.165706
Yes 0.144730 0.139067 0.173964 0.164417
In [12]:
In [12]: tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day', margins= True)
Out[12]:
day Fri Sat Sun Thur All
sex smoker
Female No 0.165296 0.147993 0.165710 0.155971 0.156921
Yes 0.209129 0.163817 0.237075 0.163073 0.182150
Male No 0.138005 0.162132 0.158291 0.165706 0.160669
Yes 0.144730 0.139067 0.173964 0.164417 0.152771
All 0.169913 0.153152 0.166897 0.161276 0.160803
In [13]:
In [13]: tips.pivot_table('tip_pct',index=['sex','smoker'],columns='day', margins= True, aggfunc=len)
Out[13]:
day Fri Sat Sun Thur All
sex smoker
Female No 2.0 13.0 14.0 25.0 54.0
Yes 7.0 15.0 4.0 7.0 33.0
Male No 2.0 32.0 43.0 20.0 97.0
Yes 8.0 27.0 15.0 10.0 60.0
All 19.0 87.0 76.0 62.0 244.0
In [14]:
In [14]:
In [14]: tips.pivot_table('size',index=['time','sex','smoker'],columns='day',aggfunc='sum')
Out[14]:
day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2.0 30.0 43.0 2.0
Yes 8.0 33.0 10.0 NaN
Male No 4.0 85.0 124.0 NaN
Yes 12.0 71.0 39.0 NaN
Lunch Female No 3.0 NaN NaN 60.0
Yes 6.0 NaN NaN 17.0
Male No NaN NaN NaN 50.0
Yes 5.0 NaN NaN 23.0
In [15]: tips.pivot_table('size',index=['time','sex','smoker'],columns='day',aggfunc='sum', fill_value=0
...: )
Out[15]:
day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2 30 43 2
Yes 8 33 10 0
Male No 4 85 124 0
Yes 12 71 39 0
Lunch Female No 3 0 0 60
Yes 6 0 0 17
Male No 0 0 0 50
Yes 5 0 0 23
In [16]:
2. 交叉表 crosstab
计算分组频率的特殊透视表。
In [20]: data = pd.DataFrame({'Sample': range(1, 11), 'Gender': ['Female', 'Male', 'Female', 'Male', 'Male', 'Male',
...: 'Female', 'Female', 'Male', 'Female'],
...: 'Handedness': ['Right-handed', 'Left-handed', 'Right-handed', 'Right-handed', 'Left-han
...: ded', 'Right-handed', 'Right-handed', 'Left-handed', 'Right-handed', 'Right-handed']})
...:
In [21]: data
Out[21]:
Sample Gender Handedness
0 1 Female Right-handed
1 2 Male Left-handed
2 3 Female Right-handed
3 4 Male Right-handed
4 5 Male Left-handed
5 6 Male Right-handed
6 7 Female Right-handed
7 8 Female Left-handed
8 9 Male Right-handed
9 10 Female Right-handed
In [22]: pd.crosstab(data.Gender,data.Handedness)
Out[22]:
Handedness Left-handed Right-handed
Gender
Female 1 4
Male 2 3
In [23]: pd.crosstab(data.Gender,data.Handedness, margins=True)
Out[23]:
Handedness Left-handed Right-handed All
Gender
Female 1 4 5
Male 2 3 5
All 3 7 10
In [24]:
2018.8.25