Pandas透视表和交叉表

透视表和交叉表


实现数据分析指标计算的常用操作:

交叉表 -> 透视表 -> 分组聚合 -> 自定义函数

  • 交叉表就是聚合函数是len个数的透视表
  • 透视表是由聚合函数是mean的分组旋转而成
  • 分组聚合就是自定义函数的一种特定操作

越往底层书写越难,应用范围越广。越往上层书写越简单,应用范围越窄


透视表(pivot table)是各种电子表格程序和数据分析软件中一种高级数据汇总表格形式

它根据一个或多个键对数据进行聚合,并根据行和列上的分组键将数据分配到各个矩形区域中


会不会操作透视表,是衡量一个人能否做数据分析项目的基准

  • 入门:用Pandas原生的pivot_table方法生成透视表
  • 进阶:使用groupby和unstack配合手动构造透视表

常用的crosstab交叉表函数结构

# 常见参数:行索引,列索引,分项小计
pd.crosstab(tips.time, [tips.smoker, tips.day], margins=True)

常用的Pivot_table 透视表函数结构

# 常见参数:需要计算的列,行索引,列索引,分项小计(默认False),自定义计算函数(默认是mean),缺失值填充
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True, aggfunc=len, fill_value=0)

# 交叉表只要把 aggfunc参数值改为 len 即可

底层:使用分组聚合和轴向旋转实现透视表

# 分组:行索引,列索引;均值聚合;行索引转列索引;填充缺失值为0
tips.groupby(['time', 'day', 'smoker'])['size', 'tip_pct'].mean().unstack().fillna(0)

# 交叉表只要把聚合函数由.mean()改为.size()即可

In [1]:

import numpy as np
import pandas as pd

文件tips.csv

total_bill,tip,smoker,day,time,size
16.99,1.01,No,Sun,Dinner,2
10.34,1.66,No,Sun,Dinner,3
21.01,3.5,No,Sun,Dinner,3
23.68,3.31,No,Sun,Dinner,2
24.59,3.61,No,Sun,Dinner,4
25.29,4.71,No,Sun,Dinner,4
8.77,2.0,No,Sun,Dinner,2
26.88,3.12,No,Sun,Dinner,4
15.04,1.96,No,Sun,Dinner,2
14.78,3.23,No,Sun,Dinner,2
10.27,1.71,No,Sun,Dinner,2
35.26,5.0,No,Sun,Dinner,4
15.42,1.57,No,Sun,Dinner,2
18.43,3.0,No,Sun,Dinner,4
14.83,3.02,No,Sun,Dinner,2
21.58,3.92,No,Sun,Dinner,2
10.33,1.67,No,Sun,Dinner,3
16.29,3.71,No,Sun,Dinner,3
16.97,3.5,No,Sun,Dinner,3
20.65,3.35,No,Sat,Dinner,3
17.92,4.08,No,Sat,Dinner,2
20.29,2.75,No,Sat,Dinner,2
15.77,2.23,No,Sat,Dinner,2
39.42,7.58,No,Sat,Dinner,4
19.82,3.18,No,Sat,Dinner,2
17.81,2.34,No,Sat,Dinner,4
13.37,2.0,No,Sat,Dinner,2
12.69,2.0,No,Sat,Dinner,2
21.7,4.3,No,Sat,Dinner,2
19.65,3.0,No,Sat,Dinner,2
9.55,1.45,No,Sat,Dinner,2
18.35,2.5,No,Sat,Dinner,4
15.06,3.0,No,Sat,Dinner,2
20.69,2.45,No,Sat,Dinner,4
17.78,3.27,No,Sat,Dinner,2
24.06,3.6,No,Sat,Dinner,3
16.31,2.0,No,Sat,Dinner,3
16.93,3.07,No,Sat,Dinner,3
18.69,2.31,No,Sat,Dinner,3
31.27,5.0,No,Sat,Dinner,3
16.04,2.24,No,Sat,Dinner,3
17.46,2.54,No,Sun,Dinner,2
13.94,3.06,No,Sun,Dinner,2
9.68,1.32,No,Sun,Dinner,2
30.4,5.6,No,Sun,Dinner,4
18.29,3.0,No,Sun,Dinner,2
22.23,5.0,No,Sun,Dinner,2
32.4,6.0,No,Sun,Dinner,4
28.55,2.05,No,Sun,Dinner,3
18.04,3.0,No,Sun,Dinner,2
12.54,2.5,No,Sun,Dinner,2
10.29,2.6,No,Sun,Dinner,2
34.81,5.2,No,Sun,Dinner,4
9.94,1.56,No,Sun,Dinner,2
25.56,4.34,No,Sun,Dinner,4
19.49,3.51,No,Sun,Dinner,2
38.01,3.0,Yes,Sat,Dinner,4
26.41,1.5,No,Sat,Dinner,2
11.24,1.76,Yes,Sat,Dinner,2
48.27,6.73,No,Sat,Dinner,4
20.29,3.21,Yes,Sat,Dinner,2
13.81,2.0,Yes,Sat,Dinner,2
11.02,1.98,Yes,Sat,Dinner,2
18.29,3.76,Yes,Sat,Dinner,4
17.59,2.64,No,Sat,Dinner,3
20.08,3.15,No,Sat,Dinner,3
16.45,2.47,No,Sat,Dinner,2
3.07,1.0,Yes,Sat,Dinner,1
20.23,2.01,No,Sat,Dinner,2
15.01,2.09,Yes,Sat,Dinner,2
12.02,1.97,No,Sat,Dinner,2
17.07,3.0,No,Sat,Dinner,3
26.86,3.14,Yes,Sat,Dinner,2
25.28,5.0,Yes,Sat,Dinner,2
14.73,2.2,No,Sat,Dinner,2
10.51,1.25,No,Sat,Dinner,2
17.92,3.08,Yes,Sat,Dinner,2
27.2,4.0,No,Thur,Lunch,4
22.76,3.0,No,Thur,Lunch,2
17.29,2.71,No,Thur,Lunch,2
19.44,3.0,Yes,Thur,Lunch,2
16.66,3.4,No,Thur,Lunch,2
10.07,1.83,No,Thur,Lunch,1
32.68,5.0,Yes,Thur,Lunch,2
15.98,2.03,No,Thur,Lunch,2
34.83,5.17,No,Thur,Lunch,4
13.03,2.0,No,Thur,Lunch,2
18.28,4.0,No,Thur,Lunch,2
24.71,5.85,No,Thur,Lunch,2
21.16,3.0,No,Thur,Lunch,2
28.97,3.0,Yes,Fri,Dinner,2
22.49,3.5,No,Fri,Dinner,2
5.75,1.0,Yes,Fri,Dinner,2
16.32,4.3,Yes,Fri,Dinner,2
22.75,3.25,No,Fri,Dinner,2
40.17,4.73,Yes,Fri,Dinner,4
27.28,4.0,Yes,Fri,Dinner,2
12.03,1.5,Yes,Fri,Dinner,2
21.01,3.0,Yes,Fri,Dinner,2
12.46,1.5,No,Fri,Dinner,2
11.35,2.5,Yes,Fri,Dinner,2
15.38,3.0,Yes,Fri,Dinner,2
44.3,2.5,Yes,Sat,Dinner,3
22.42,3.48,Yes,Sat,Dinner,2
20.92,4.08,No,Sat,Dinner,2
15.36,1.64,Yes,Sat,Dinner,2
20.49,4.06,Yes,Sat,Dinner,2
25.21,4.29,Yes,Sat,Dinner,2
18.24,3.76,No,Sat,Dinner,2
14.31,4.0,Yes,Sat,Dinner,2
14.0,3.0,No,Sat,Dinner,2
7.25,1.0,No,Sat,Dinner,1
38.07,4.0,No,Sun,Dinner,3
23.95,2.55,No,Sun,Dinner,2
25.71,4.0,No,Sun,Dinner,3
17.31,3.5,No,Sun,Dinner,2
29.93,5.07,No,Sun,Dinner,4
10.65,1.5,No,Thur,Lunch,2
12.43,1.8,No,Thur,Lunch,2
24.08,2.92,No,Thur,Lunch,4
11.69,2.31,No,Thur,Lunch,2
13.42,1.68,No,Thur,Lunch,2
14.26,2.5,No,Thur,Lunch,2
15.95,2.0,No,Thur,Lunch,2
12.48,2.52,No,Thur,Lunch,2
29.8,4.2,No,Thur,Lunch,6
8.52,1.48,No,Thur,Lunch,2
14.52,2.0,No,Thur,Lunch,2
11.38,2.0,No,Thur,Lunch,2
22.82,2.18,No,Thur,Lunch,3
19.08,1.5,No,Thur,Lunch,2
20.27,2.83,No,Thur,Lunch,2
11.17,1.5,No,Thur,Lunch,2
12.26,2.0,No,Thur,Lunch,2
18.26,3.25,No,Thur,Lunch,2
8.51,1.25,No,Thur,Lunch,2
10.33,2.0,No,Thur,Lunch,2
14.15,2.0,No,Thur,Lunch,2
16.0,2.0,Yes,Thur,Lunch,2
13.16,2.75,No,Thur,Lunch,2
17.47,3.5,No,Thur,Lunch,2
34.3,6.7,No,Thur,Lunch,6
41.19,5.0,No,Thur,Lunch,5
27.05,5.0,No,Thur,Lunch,6
16.43,2.3,No,Thur,Lunch,2
8.35,1.5,No,Thur,Lunch,2
18.64,1.36,No,Thur,Lunch,3
11.87,1.63,No,Thur,Lunch,2
9.78,1.73,No,Thur,Lunch,2
7.51,2.0,No,Thur,Lunch,2
14.07,2.5,No,Sun,Dinner,2
13.13,2.0,No,Sun,Dinner,2
17.26,2.74,No,Sun,Dinner,3
24.55,2.0,No,Sun,Dinner,4
19.77,2.0,No,Sun,Dinner,4
29.85,5.14,No,Sun,Dinner,5
48.17,5.0,No,Sun,Dinner,6
25.0,3.75,No,Sun,Dinner,4
13.39,2.61,No,Sun,Dinner,2
16.49,2.0,No,Sun,Dinner,4
21.5,3.5,No,Sun,Dinner,4
12.66,2.5,No,Sun,Dinner,2
16.21,2.0,No,Sun,Dinner,3
13.81,2.0,No,Sun,Dinner,2
17.51,3.0,Yes,Sun,Dinner,2
24.52,3.48,No,Sun,Dinner,3
20.76,2.24,No,Sun,Dinner,2
31.71,4.5,No,Sun,Dinner,4
10.59,1.61,Yes,Sat,Dinner,2
10.63,2.0,Yes,Sat,Dinner,2
50.81,10.0,Yes,Sat,Dinner,3
15.81,3.16,Yes,Sat,Dinner,2
7.25,5.15,Yes,Sun,Dinner,2
31.85,3.18,Yes,Sun,Dinner,2
16.82,4.0,Yes,Sun,Dinner,2
32.9,3.11,Yes,Sun,Dinner,2
17.89,2.0,Yes,Sun,Dinner,2
14.48,2.0,Yes,Sun,Dinner,2
9.6,4.0,Yes,Sun,Dinner,2
34.63,3.55,Yes,Sun,Dinner,2
34.65,3.68,Yes,Sun,Dinner,4
23.33,5.65,Yes,Sun,Dinner,2
45.35,3.5,Yes,Sun,Dinner,3
23.17,6.5,Yes,Sun,Dinner,4
40.55,3.0,Yes,Sun,Dinner,2
20.69,5.0,No,Sun,Dinner,5
20.9,3.5,Yes,Sun,Dinner,3
30.46,2.0,Yes,Sun,Dinner,5
18.15,3.5,Yes,Sun,Dinner,3
23.1,4.0,Yes,Sun,Dinner,3
15.69,1.5,Yes,Sun,Dinner,2
19.81,4.19,Yes,Thur,Lunch,2
28.44,2.56,Yes,Thur,Lunch,2
15.48,2.02,Yes,Thur,Lunch,2
16.58,4.0,Yes,Thur,Lunch,2
7.56,1.44,No,Thur,Lunch,2
10.34,2.0,Yes,Thur,Lunch,2
43.11,5.0,Yes,Thur,Lunch,4
13.0,2.0,Yes,Thur,Lunch,2
13.51,2.0,Yes,Thur,Lunch,2
18.71,4.0,Yes,Thur,Lunch,3
12.74,2.01,Yes,Thur,Lunch,2
13.0,2.0,Yes,Thur,Lunch,2
16.4,2.5,Yes,Thur,Lunch,2
20.53,4.0,Yes,Thur,Lunch,4
16.47,3.23,Yes,Thur,Lunch,3
26.59,3.41,Yes,Sat,Dinner,3
38.73,3.0,Yes,Sat,Dinner,4
24.27,2.03,Yes,Sat,Dinner,2
12.76,2.23,Yes,Sat,Dinner,2
30.06,2.0,Yes,Sat,Dinner,3
25.89,5.16,Yes,Sat,Dinner,4
48.33,9.0,No,Sat,Dinner,4
13.27,2.5,Yes,Sat,Dinner,2
28.17,6.5,Yes,Sat,Dinner,3
12.9,1.1,Yes,Sat,Dinner,2
28.15,3.0,Yes,Sat,Dinner,5
11.59,1.5,Yes,Sat,Dinner,2
7.74,1.44,Yes,Sat,Dinner,2
30.14,3.09,Yes,Sat,Dinner,4
12.16,2.2,Yes,Fri,Lunch,2
13.42,3.48,Yes,Fri,Lunch,2
8.58,1.92,Yes,Fri,Lunch,1
15.98,3.0,No,Fri,Lunch,3
13.42,1.58,Yes,Fri,Lunch,2
16.27,2.5,Yes,Fri,Lunch,2
10.09,2.0,Yes,Fri,Lunch,2
20.45,3.0,No,Sat,Dinner,4
13.28,2.72,No,Sat,Dinner,2
22.12,2.88,Yes,Sat,Dinner,2
24.01,2.0,Yes,Sat,Dinner,4
15.69,3.0,Yes,Sat,Dinner,3
11.61,3.39,No,Sat,Dinner,2
10.77,1.47,No,Sat,Dinner,2
15.53,3.0,Yes,Sat,Dinner,2
10.07,1.25,No,Sat,Dinner,2
12.6,1.0,Yes,Sat,Dinner,2
32.83,1.17,Yes,Sat,Dinner,2
35.83,4.67,No,Sat,Dinner,3
29.03,5.92,No,Sat,Dinner,3
27.18,2.0,Yes,Sat,Dinner,2
22.67,2.0,Yes,Sat,Dinner,2
17.82,1.75,No,Sat,Dinner,2
18.78,3.0,No,Thur,Dinner,2

In [3]:

# 小费数据集
tips = pd.read_csv('examples/tips.csv')

# 生成一个新指标,小费占消费总额的百分比
tips['tip_pct'] = tips['tip'] / tips['total_bill']

tips.head()

Out[3]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

In [4]:

# 抽取一个小表计算:
t2 = tips[['day', 'time', 'tip']]
t2.head()

Out[4]:

day time tip
0 Sun Dinner 1.01
1 Sun Dinner 1.66
2 Sun Dinner 3.50
3 Sun Dinner 3.31
4 Sun Dinner 3.61

例子:每周各天(day)的午餐晚餐(time)小费平均值(pivot_table的默认聚合类型)

使用Pandas自带透视表函数 pivot_table 实现

In [10]:

# 透视表
t2.pivot_table('tip', index='day', columns='time')

# 聚合运算列,行索引列,列索引列,缺失值填充
t2.pivot_table('tip', index='day', columns='time', fill_value=0)  # 填充缺失值为0

Out[10]:

time Dinner Lunch
day
Fri 2.940000 2.382857
Sat 2.993103 0.000000
Sun 3.255132 0.000000
Thur 3.000000 2.767705

使用原生使用分组聚合(groupby)和重塑(unstack)功能实现

In [15]:

t2.groupby(['day', 'time'])['tip'].mean().unstack().fillna(0)

Out[15]:

time Dinner Lunch
day
Fri 2.940000 2.382857
Sat 2.993103 0.000000
Sun 3.255132 0.000000
Thur 3.000000 2.767705

根据day和smoker计算分组平均数,并将day和smoker放到行索引上

In [16]:

tips.head()

Out[16]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

In [17]:

# 透视表函数实现
tips.pivot_table(index=['day', 'smoker'])

Out[17]:

size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588

In [20]:

# 分组聚合实现
tips.groupby(['day', 'smoker']).mean()
tips.groupby(['day', 'smoker']).mean().sort_index(axis=1)  # 按列索引排序

Out[20]:

size tip tip_pct total_bill
day smoker
Fri No 2.250000 2.812500 0.151650 18.420000
Yes 2.066667 2.714000 0.174783 16.813333
Sat No 2.555556 3.102889 0.158048 19.661778
Yes 2.476190 2.875476 0.147906 21.276667
Sun No 2.929825 3.167895 0.160113 20.506667
Yes 2.578947 3.516842 0.187250 24.120000
Thur No 2.488889 2.673778 0.160298 17.113111
Yes 2.352941 3.030000 0.163863 19.190588

如果只想聚合tip_pct和size,而且想根据time进行分组。再将smoker放到列索引上,把day放到行索引上

In [22]:

tips.head()

Out[22]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

In [24]:

# 透视表函数实现
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker')

Out[24]:


Pandas透视表和交叉表_第1张图片
image.png

In [27]:

# 分组聚合实现
# 使用groupby和unstack实现
# 更底层、更基础的实现方式,透视表构造的原理和过程一步一步的展现

# groupby参数是行索引,后面的参数是列索引
tips.groupby(['time', 'day', 'smoker'])['size', 'tip_pct'].mean().unstack()

Out[27]:


Pandas透视表和交叉表_第2张图片
image.png

pivot_table 其他参数

传入margins=True添加分项小计
这将会添加标签为All的行和列,其值对应于单个等级中所有数据的分组统计
All值为平均数:不单独考虑烟民与非烟民(All列),不单独考虑行分组两个级别中的任何单项(All行)

In [29]:

t2.pivot_table('tip', index='day', columns='time')

Out[29]:

time Dinner Lunch
day
Fri 2.940000 2.382857
Sat 2.993103 NaN
Sun 3.255132 NaN
Thur 3.000000 2.767705

In [30]:

t2.pivot_table('tip', index='day', columns='time', margins=True)  # 分项小计,平均值

Out[30]:

time Dinner Lunch All
day
Fri 2.940000 2.382857 2.734737
Sat 2.993103 NaN 2.993103
Sun 3.255132 NaN 3.255132
Thur 3.000000 2.767705 2.771452
All 3.102670 2.728088 2.998279

In [31]:

(2.940000 + 2.382857) / 2  # 对不上

Out[31]:

2.6614285

All统计求的是所有行或所有列的平均值,不是透视表那几行几列的平均值

In [32]:

(1+2+3+4+5)/5

Out[32]:

3.0

In [33]:

(1+2+3)/3

Out[33]:

2.0

In [34]:

(4+5)/2

Out[34]:

4.5

In [35]:

(2 + 4.5) / 2

Out[35]:

3.25

求所有dinner下tip的平均值

In [36]:

(2.940000 + 2.993103 + 3.255132 + 3.000000) / 4  # 对不上All

Out[36]:

3.04705875

In [37]:

t2.head()

Out[37]:

day time tip
0 Sun Dinner 1.01
1 Sun Dinner 1.66
2 Sun Dinner 3.50
3 Sun Dinner 3.31
4 Sun Dinner 3.61

In [42]:

# 求所有dinner下tip的平均值,和分项小计相等
t2[t2['time'] == 'Dinner']['tip'].mean()

Out[42]:

3.102670454545454

In [43]:

# 上例透视表,添加分项小计
tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True)

Out[43]:

Pandas透视表和交叉表_第3张图片
image.png

透视表默认聚合函数为mean平均值

如想使用非默认mean的其他的聚合函数,传给aggfunc即可(传入函数名称或函数字符串)

如使用count或len可得到有关分组大小的交叉表(计数或频率)

传入值类型,一般为 函数名字符串,函数名,numpy函数名:

len
'count'
np.max

In [48]:

tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True, aggfunc=len)

Out[48]:

Pandas透视表和交叉表_第4张图片

fill_value参数填充缺失值(NA)

In [50]:

f = tips.pivot_table(['tip_pct', 'size'], index=['time', 'day'], columns='smoker', margins=True, aggfunc=len, fill_value=0)
f

Out[50]:

Pandas透视表和交叉表_第5张图片
image.png

In [52]:

# astype修改数据类型:np.int np.float
f.astype(np.int)  # 改为整型

Out[52]:

Pandas透视表和交叉表_第6张图片
image.png

交叉表:crosstab

交叉表(cross-tabulation,简称crosstab)是一种用于计算分组频率的特殊透视表

透视表由是分组后聚合函数为平均值的表旋转而成
交叉表就是聚合函数为个数的透视表

In [53]:

data = pd.DataFrame(
    {
        'Sample': np.arange(1,11),
        'Nationality': ["USA", "Japan", "USA", "Japan", "Japan", "Japan", "USA", "USA", "Japan", "USA"],
        'Handedness': ["Right-handed", "Left-handed", "Right-handed", "Right-handed", "Left-handed", "Right-handed", "Right-handed", "Left-handed", "Right-handed", "Right-handed"],
    },
    columns=['Sample', 'Nationality', 'Handedness']
)
data

Out[53]:

Sample Nationality Handedness
0 1 USA Right-handed
1 2 Japan Left-handed
2 3 USA Right-handed
3 4 Japan Right-handed
4 5 Japan Left-handed
5 6 Japan Right-handed
6 7 USA Right-handed
7 8 USA Left-handed
8 9 Japan Right-handed
9 10 USA Right-handed

In [56]:

# 抽取单列
data['Nationality']
data.loc[:, 'Nationality']
data.Nationality

Out[56]:

0      USA
1    Japan
2      USA
3    Japan
4    Japan
5    Japan
6      USA
7      USA
8    Japan
9      USA
Name: Nationality, dtype: object

In [57]:

# 交叉表
pd.crosstab(data.Nationality, data.Handedness)  # 1列行索引,1列列索引

Out[57]:

Handedness Left-handed Right-handed
Nationality
Japan 2 3
USA 1 4

In [58]:

# 添加分项小计
pd.crosstab(data.Nationality, data.Handedness, margins=True)

Out[58]:

Handedness Left-handed Right-handed All
Nationality
Japan 2 3 5
USA 1 4 5
All 3 7 10

In [64]:

# 使用透视表实现交叉表效果
data.pivot_table('Sample', index='Nationality', columns='Handedness', aggfunc=len, margins=True)  # 写法1
pd.pivot_table(data, index='Nationality', columns='Handedness', aggfunc=len, margins=True)['Sample']  # 写法2

Out[64]:

Handedness Left-handed Right-handed All
Nationality
Japan 2 3 5
USA 1 4 5
All 3 7 10

In [67]:

# 底层实现:分组聚合和轴旋转实现交叉表效果
y = data.groupby(['Nationality', 'Handedness']).size().unstack()
y

Out[67]:

Handedness Left-handed Right-handed
Nationality
Japan 2 3
USA 1 4

In [69]:

# 增加分项小计行
y.sum()

y.loc['All'] = y.sum()
y

Out[69]:

Handedness Left-handed Right-handed
Nationality
Japan 2 3
USA 1 4
All 3 7

In [71]:

# 增加分项小计列
y.sum(axis=1)

y['All'] = y.sum(axis=1)
y

Out[71]:

Handedness Left-handed Right-handed All
Nationality
Japan 2 3 5
USA 1 4 5
All 3 7 10

例子:小费数据 交叉表学习

In [72]:

tips.head()

Out[72]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

统计顾客在 每种用餐时间、每个星期下 的 吸烟数量情况

行索引:time,day

列索引:smoker

使用交叉表方法实现

In [75]:

pd.crosstab([tips.time, tips.day], tips.smoker, margins=True)

Out[75]:

Pandas透视表和交叉表_第7张图片
image.png

In [77]:

pd.crosstab(tips.time, [tips.day, tips.smoker], margins=True)

Out[77]:

Pandas透视表和交叉表_第8张图片
image.png

使用透视表方法实现

In [79]:

tips.head()

Out[79]:

total_bill tip smoker day time size tip_pct
0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808

In [80]:

pd.pivot_table(tips, index=['time', 'day'], columns='smoker', aggfunc=len, margins=True, fill_value=0)['size']

Out[80]:

Pandas透视表和交叉表_第9张图片
image.png

使用分组聚合轴旋转实现

In [84]:

tips.groupby(['time', 'day', 'smoker']).size().unstack().fillna(0).astype(np.int)

Out[84]:


Pandas透视表和交叉表_第10张图片
image.png

你可能感兴趣的:(Pandas透视表和交叉表)