聚合数据
pivot_table()
将列数据设定为行索引和列索引,并可以聚合运算。
(我总觉得,pivot_table 就是把分组key放到index和columns进行二维分组)
(pivot() 只能将列数据转换成行索引和列索引,不能运算,而且如果某项数据出现重复时,将无法执行。)
pivot_table() 既是顶级类函数,也是实例对象函数。
“一般的经验法则是,一旦使用多个“grouby”,那么需要评估此时使用透视表是否是一种好的选择。”
My general rule of thumb is that once you use multiple grouby you should evaluate whether a pivot table is a useful approach.——Chris Moffitt
df.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
参数 | 类型 | 说明 |
data | DataFrame | pd.pivot_table使用,设定需要操作的 DataFrame |
values | column | 被计算的数据项 可选项 设定需要被聚合操作的列 |
index | array column grouper list of the previous |
行分组键 用于分组的列名或其他分组键,作为结果DataFrame的行索引 Keys to group by on the pivot table index |
columns | array column grouper list of the previous |
列分组键 用于分组的列名或其他分组键,作为结果DataFrame的列索引 Keys to group by on the pivot table column |
aggfunc | dict function list of functions |
numpy.mean 默认值 聚合函数或函数列表 如果aggfunc中出现list [ ],则在结果DataFrame中,list 内的函数名称肯定会出现在 columns中
|
fill_value | scalar | None 默认值 设定缺省值 |
dropna | boolean | True 默认值 如果列的所有值都是NaN,将被删除;False时,被保留 |
margins | boolean | False 默认值 True时,会添加行/列的总计 |
margins_name | string | 'All' 默认值 margins = True 时,设定margins 行/列的名称 |
>>> df.head()
Account Name Rep Manager Product Quantity Price Status
0 714466 Trantow Craig Henley CPU 1 30000 presented
1 714466 Trantow Craig Henley Software 1 10000 presented
2 714466 Trantow Craig Henley Maintenance 2 5000 pending
3 737550 Fritsch Craig Henley CPU 1 35000 declined
4 146832 Kiehn Daniel Henley CPU 2 65000 won
#将“Status”列定义为category,并设置顺序
#能够在分析数据的整个过程中,得到想要的顺序
>>> df["Status"] = df["Status"].astype("category")
>>> df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)
-----------------------------------------------------------------------------
#设置一个行索引
#index=["Name"]
#只保留能聚合的数据列,其他数据列未保留
>>> df.pivot_table(index=["Name"]) /
>>> df.pivot_table(index=["Name"],values=["Account","Price","Quantity"])
Account Price Quantity
Name
Barton 740150 35000 1.000000
Fritsch 737550 35000 1.000000
Herman 141962 65000 2.000000
Jerde 412290 5000 2.000000
Kassulke 307599 7000 3.000000
#设置多个行索引
#index=["Name","Rep","Manager"]
>>> df.pivot_table(index=["Name","Rep","Manager"]) /
>>> df.pivot_table(index=["Name","Rep","Manager"],
values=["Account","Price","Quantity"])
Account Price Quantity
Name Rep Manager
Barton John Debra 740150 35000 1.000000
Fritsch Craig Debra 737550 35000 1.000000
Herman Cedric Fred 141962 65000 2.000000
Jerde John Debra 412290 5000 2.000000
Kassulke Wendy Fred 307599 7000 3.000000
#设置更有实际意义的行索引
>>> df.pivot_table(index=["Manager","Rep"]) /
>>> df.pivot_table(index=["Manager","Rep"],
values=["Account","Price","Quantity"])
Account Price Quantity
Manager Rep
Debra Craig 720237.0 20000.000000 1.250000
Daniel 194874.0 38333.333333 1.666667
John 576220.0 20000.000000 1.500000
Fred Cedric 196016.5 27500.000000 1.250000
Wendy 614061.5 44250.000000 3.000000
-----------------------------------------------------------------------------
#准备聚合运算,保留需要聚合运算的"Price"列
#"Price"列会自动计算数据的平均值
# aggfunc=np.mean 默认
>>> df.pivot_table(index=["Manager","Rep"],values=["Price"])
Price
Manager Rep
Debra Craig 20000
Daniel 38333
John 20000
Fred Cedric 27500
Wendy 44250
#对"Price"列执行"求和"运算
#不使用默认的"平均值"运算
# aggfunc=np.sum
>>> df.pivot_table(index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)
Price
Manager Rep
Debra Craig 80000
Daniel 115000
John 40000
Fred Cedric 110000
Wendy 177000
#可以对列同时执行多种运算
#对"Price"列执行"平均数"和"统计"运算
# aggfunc=[np.mean,len]
>>> df.pivot_table(index=["Manager","Rep"],
values=["Price"],aggfunc=[np.mean,len])
mean len
Price Price
Manager Rep
Debra Craig 20000 4
Daniel 38333 3
John 20000 2
Fred Cedric 27500 4
Wendy 44250 4
-----------------------------------------------------------------------------
#"Pirce"列的值是由多种"Product"项组成
#可以根据"Product"分组,将"Product"放置在 index 和 columns 会得到不同的数据视图
#将"Product"放置在 columns columns=["Product"] (得到横向数据视图,数据集很宽)
#将"Product"放置在 index index=["Product"] (得到竖向数据视图,数据集很长)
#★首先columns=["Product"],将分组项移到列
#设定了columns 后,会产生多级列索引
# columns 会是最内层的列索引
# values 会是外层的列索引
#fill_value=0,将产生的缺省值 NaN 都设定为 0
>>> df.pivot_table(index=["Manager","Rep"],
values=["Price"],
columns=["Product"],aggfunc=[np.sum],fill_value=0)
sum
Price
Product CPU Maintenance Monitor Software
Manager Rep
Debra Craig 65000 5000 0 10000
Daniel 105000 0 0 10000
John 35000 5000 0 0
Fred Cedric 95000 5000 0 10000
Wendy 165000 7000 5000 0
#增加"Quantity"列
#增加"Quantity"列后,columns 仍旧是最内层的列索引
#列索引排序 'values' >> 'columns'
>>> df.pivot_table(index=["Manager","Rep"],
values=["Price","Quantity"],
columns=["Product"],aggfunc=[np.sum],fill_value=0)
sum
Price Quantity
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Rep
Debra Craig 65000 5000 0 10000 2 2 0 1
Daniel 105000 0 0 10000 4 0 0 1
John 35000 5000 0 0 1 2 0 0
Fred Cedric 95000 5000 0 10000 3 1 0 1
Wendy 165000 7000 5000 0 7 3 2 0
#★然后index=["Product"],将分组项移到行
>>> df.pivot_table(index=["Manager","Rep","Product"],
values=["Price","Quantity"],aggfunc=[np.sum],fill_value=0,dropna=True)
sum
Price Quantity
Manager Rep Product
Debra Craig CPU 65000 2
Maintenance 5000 2
Software 10000 1
Daniel CPU 105000 4
Software 10000 1
John CPU 35000 1
Maintenance 5000 2
Fred Cedric CPU 95000 3
Maintenance 5000 1
Software 10000 1
Wendy CPU 165000 7
Maintenance 7000 3
Monitor 5000 2
-----------------------------------------------------------------------------
#将 aggfunc 设置为dict,不同的 values 可以执行不同的函数
#dict 的 key 是 value
#dict 的 value 是 函数名
#aggfunc={"Quantity":len,"Price":np.sum}
>>> df.pivot_table(index=["Manager","Status"],
columns=["Product"],
values=["Quantity","Price"],
aggfunc={"Quantity":len,"Price":np.sum},fill_value=0)
Price Quantity
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra declined 70000 0 0 0 2 0 0 0
pending 40000 10000 0 0 1 2 0 0
presented 30000 0 0 20000 1 0 0 2
won 65000 0 0 0 1 0 0 0
Fred declined 65000 0 0 0 1 0 0 0
pending 0 5000 0 0 0 1 0 0
presented 30000 0 5000 10000 1 0 1 1
won 165000 7000 0 0 2 1 0 0
#aggfunc={"Quantity":len,"Price":[np.sum,np.mean]}
>>> df.pivot_table(index=["Manager","Status"],
columns=["Product"],
values=["Quantity","Price"],
aggfunc={"Quantity":len,"Price":[np.sum,np.mean]},fill_value=0)
Price Quantity
mean sum len
Product CPU Maintenance Monitor Software CPU Maintenance Monitor Software CPU Maintenance Monitor Software
Manager Status
Debra Henley declined 35000 0 0 0 70000 0 0 0 2 0 0 0
pending 40000 5000 0 0 40000 10000 0 0 1 2 0 0
presented 30000 0 0 10000 30000 0 0 20000 1 0 0 2
won 65000 0 0 0 65000 0 0 0 1 0 0 0
Fred Anderson declined 65000 0 0 0 65000 0 0 0 1 0 0 0
pending 0 5000 0 0 0 5000 0 0 0 1 0 0
presented 30000 0 5000 10000 30000 0 5000 10000 1 0 1 1
won 82500 7000 0 0 165000 7000 0 0 2 1 0 0
-----------------------------------------------------------------------------
#设置总计,自动计算行/列的总和
#margins=True
#margin 是根据"分组"情况而变化的( 也就是index / columns 的情况)
#这也是合理的,既然是"总计",前提是"肯定有分组了",如果没有分组,何来"总计"呢?
#存在 index 时,会出现底部的 margin (行分组时,存在底部的一个总计)
#存在 columns 时,会出现最右侧的 margin (列分组时,存在最右侧的一个总计)
#★所以只有 index 而没有 columns 时,只有最底部的一个 margin
>>> df.pivot_table(index="Manager",
values="Price",
margins=True,margins_name="New Sum",
aggfunc=np.sum,fill_value=0)
Price
Manager
Debra Henley 235000
Fred Anderson 287000
New Sum 522000
#★添加 columns 后,出现最右侧的 margin
>>> df.pivot_table(index="Manager"],
values="Price",
columns="Product",
margins=True,margins_name="New Sum",
aggfunc=np.sum,fill_value=0)
Product CPU Maintenance Monitor Software New Sum
Manager
Debra Henley 205000 10000 0 20000 235000
Fred Anderson 260000 12000 5000 10000 287000
New Sum 465000 22000 5000 10000 522000
#★而且每一个 values 最右侧都会有一个 margin
#前提是存在 columns(如果没有 columns,则只有最底部的 margin)
>>> df.pivot_table(index=["Manager"],
values=["Price","Quantity"],
columns="Product",
margins=True,margins_name="New Sum",
aggfunc=np.sum,fill_value=0)
Price Quantity
Product CPU Maintenance Monitor Software New Sum CPU Maintenance Monitor Software New Sum
Manager
Debra Henley 205000 10000 0 20000 235000 7 4 0 2 13
Fred Anderson 260000 12000 5000 10000 287000 10 4 2 1 17
New Sum 465000 22000 5000 30000 522000 17 8 2 3 30
#★margin 的运算规则
#不是简单地求和,而是与 aggfunc 的规则相同
#如果 aggfunc=np.mean,则 margin 的值也是"求平均值"
#如果 aggfunc=np.sum,则 margin 的值也是"求和"
-----------------------------------------------------------------------------
#★注意 "Status" 列的排序,正是前面所设定的顺序
>>> df.pivot_table(index=["Manager","Status"],
values=["Price"],
aggfunc=[np.sum],fill_value=0,margins=True)
sum
Price
Manager Status
Debra Henley declined 70000
pending 50000
presented 50000
won 65000
Fred Anderson declined 65000
pending 5000
presented 45000
won 172000
All 522000
透视表的深入过滤
可以将生成的 DataFrame 存储在 Table 中,这样就可以使用函数对其分析 |
|
本文的引用:
详解 Pandas 透视表(pivot_table)
Pandas Pivot Table Explained_pbpython.com
Pandas Pivot Table Explained_jupyter.org
☑√☒×☺☹