微信号 & QQ:862251340
大多数人可能都有使用 Excel 中的数据透视表的经验。 pandas 提供了一个类似的功能,称为 pivot_table。虽然它非常有用,但我经常发现自己很难记住如何使用语法格式化输出以满足我的需求。本文将重点介绍 pandas pivot_table 函数以及如何将其用于数据分析。
作为一个额外的奖励,我创建了一个简单的备忘录,总结了 pivot_table 。你可以在这篇文章的最后找到它,我希望它是一个有用的参考。
使用 pandas 的 pivot_table 的一个挑战是确保你了解你的数据以及你尝试使用数据透视表来处理问题。这是一个看似简单的功能,但可以非常快速的产生非常强大的分析。
许多公司将拥有销售用于跟踪流程的 CRM 工具或其他软件。虽然他们可能有分析数据的有用工具,但不可避免的有人会将数据导出到 Excel 并使用数据透视表来汇总数据。
使用 padnas 的数据透视表可能是一个很好的选择,因为它是:
让我们先建立我们的环境,数据你可以点击这个下载。然后将我们的销售数据读入 DataFrame ,如下:
import pandas as pd
import numpy as np
df = pd.read_excel("./sales-funnel.xlsx")
Account | Name | Rep | Manager | Product | Quantity | Price | Status | |
0 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | CPU | 1 | 30000 | presented |
1 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Software | 1 | 10000 | presented |
2 | 714466 | Trantow-Barrows | Craig Booker | Debra Henley | Maintenance | 2 | 5000 | pending |
3 | 737550 | Fritsch, Russel and Anderson | Craig Booker | Debra Henley | CPU | 1 | 35000 | declined |
4 | 146832 | Kiehn-Spinka | Daniel Hilton | Debra Henley | CPU | 2 | 65000 | won |
df["Status"] = df["Status"].astype("category")
最简单的数据透视表必须具有数据框和索引。在这种情况下,让我们使用 Name 作为索引。
Account | Price | Quantity | |
Name | |||
Barton LLC | 740150.0 | 35000.0 | 1.000000 |
Fritsch, Russel and Anderson | 737550.0 | 35000.0 | 1.000000 |
Herman LLC | 141962.0 | 65000.0 | 2.000000 |
Jerde-Hilpert | 412290.0 | 5000.0 | 2.000000 |
Kassulke, Ondricka and Metz | 307599.0 | 7000.0 | 3.000000 |
Keeling LLC | 688981.0 | 100000.0 | 5.000000 |
Kiehn-Spinka | 146832.0 | 65000.0 | 2.000000 |
Koepp Ltd | 729833.0 | 35000.0 | 2.000000 |
Kulas Inc | 218895.0 | 25000.0 | 1.500000 |
Purdy-Kunde | 163416.0 | 30000.0 | 1.000000 |
Stokes LLC | 239344.0 | 7500.0 | 1.000000 |
Trantow-Barrows | 714466.0 | 15000.0 | 1.333333 |
你也可以拥有多个索引。是加上,大多数的 pivot_table args 都可以通过列表获取多个值。
Account | Price | Quantity | |||
Name | Rep | Manager | |||
Barton LLC | John Smith | Debra Henley | 740150.0 | 35000.0 | 1.000000 |
Fritsch, Russel and Anderson | Craig Booker | Debra Henley | 737550.0 | 35000.0 | 1.000000 |
Herman LLC | Cedric Moss | Fred Anderson | 141962.0 | 65000.0 | 2.000000 |
Jerde-Hilpert | John Smith | Debra Henley | 412290.0 | 5000.0 | 2.000000 |
Kassulke, Ondricka and Metz | Wendy Yule | Fred Anderson | 307599.0 | 7000.0 | 3.000000 |
Keeling LLC | Wendy Yule | Fred Anderson | 688981.0 | 100000.0 | 5.000000 |
Kiehn-Spinka | Daniel Hilton | Debra Henley | 146832.0 | 65000.0 | 2.000000 |
Koepp Ltd | Wendy Yule | Fred Anderson | 729833.0 | 35000.0 | 2.000000 |
Kulas Inc | Daniel Hilton | Debra Henley | 218895.0 | 25000.0 | 1.500000 |
Purdy-Kunde | Cedric Moss | Fred Anderson | 163416.0 | 30000.0 | 1.000000 |
Stokes LLC | Cedric Moss | Fred Anderson | 239344.0 | 7500.0 | 1.000000 |
Trantow-Barrows | Craig Booker | Debra Henley | 714466.0 | 15000.0 | 1.333333 |
这很有趣但不是特别有用。我们可能想要做的是通过 Manager 和 Rep 查看。通过更改索引可以轻松完成。
Account | Price | Quantity | ||
Manager | Rep | |||
Debra Henley | Craig Booker | 720237.0 | 20000.000000 | 1.250000 |
Daniel Hilton | 194874.0 | 38333.333333 | 1.666667 | |
John Smith | 576220.0 | 20000.000000 | 1.500000 | |
Fred Anderson | Cedric Moss | 196016.5 | 27500.000000 | 1.250000 |
Wendy Yule | 614061.5 | 44250.000000 | 3.000000 |
你可以看到数据透视表非常智能,可以通过将 reps 与 manager 分组来开始汇总数据并对其进行汇总。现在我们开始了解数据透视表可以为我们做些什么。
为此,“账户” 和 “数量” 列不真正有用。让我们通过使用 values 字段显式定义我们关心的列来删除它。
Price | ||
Manager | Rep | |
Debra Henley | Craig Booker | 20000.000000 |
Daniel Hilton | 38333.333333 | |
John Smith | 20000.000000 | |
Fred Anderson | Cedric Moss | 27500.000000 |
Wendy Yule | 44250.000000 |
价格列自动平均数据,但我们可以进行计数或者总和。使用 aggfunc 和 np.sum 添加它们很简单。
Price | ||
Manager | Rep | |
Debra Henley | Craig Booker | 80000 |
Daniel Hilton | 115000 | |
John Smith | 40000 | |
Fred Anderson | Cedric Moss | 110000 |
Wendy Yule | 177000 |
aggfunc 可以获取一系列函数。让我们尝试使用 np.mean 函数和 len 来计算。
mean | len | ||
Price | Price | ||
Manager | Rep | ||
Debra Henley | Craig Booker | 20000.000000 | 4 |
Daniel Hilton | 38333.333333 | 3 | |
John Smith | 20000.000000 | 2 | |
Fred Anderson | Cedric Moss | 27500.000000 | 4 |
Wendy Yule | 44250.000000 | 4 |
如果我们想要查看按产品细分的销售额,则 columns 变量允许我们定义一个或者多格列。
我认为 pivot_table 的一个令人困惑的问题是使用列和值。请记住,列是可选的 —— 它们提供了一种额外的方法来细分你关心的实际值。聚合函数将应用于你列出的值。
sum | |||||
Price | |||||
Product | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||
Debra Henley | Craig Booker | 65000.0 | 5000.0 | NaN | 10000.0 |
Daniel Hilton | 105000.0 | NaN | NaN | 10000.0 | |
John Smith | 35000.0 | 5000.0 | NaN | NaN | |
Fred Anderson | Cedric Moss | 95000.0 | 5000.0 | NaN | 10000.0 |
Wendy Yule | 165000.0 | 7000.0 | 5000.0 | NaN |
NaN 有点让人抓狂。如果我们想要删除它们,我们可以使用 fill_value 将它们设置为 0 。
sum | |||||
Price | |||||
Product | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||
Debra Henley | Craig Booker | 65000 | 5000 | 0 | 10000 |
Daniel Hilton | 105000 | 0 | 0 | 10000 | |
John Smith | 35000 | 5000 | 0 | 0 | |
Fred Anderson | Cedric Moss | 95000 | 5000 | 0 | 10000 |
Wendy Yule | 165000 | 7000 | 5000 | 0 |
sum | |||||||||
Price | Quantity | ||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Rep | ||||||||
Debra Henley | Craig Booker | 65000 | 5000 | 0 | 10000 | 2 | 2 | 0 | 1 |
Daniel Hilton | 105000 | 0 | 0 | 10000 | 4 | 0 | 0 | 1 | |
John Smith | 35000 | 5000 | 0 | 0 | 1 | 2 | 0 | 0 | |
Fred Anderson | Cedric Moss | 95000 | 5000 | 0 | 10000 | 3 | 1 | 0 | 1 |
Wendy Yule | 165000 | 7000 | 5000 | 0 | 7 | 3 | 2 | 0 |
sum | ||||
Price | Quantity | |||
Manager | Rep | Product | ||
Debra Henley | Craig Booker | CPU | 65000 | 2 |
Maintenance | 5000 | 2 | ||
Software | 10000 | 1 | ||
Daniel Hilton | CPU | 105000 | 4 | |
Software | 10000 | 1 | ||
John Smith | CPU | 35000 | 1 | |
Maintenance | 5000 | 2 | ||
Fred Anderson | Cedric Moss | CPU | 95000 | 3 |
Maintenance | 5000 | 1 | ||
Software | 10000 | 1 | ||
Wendy Yule | CPU | 165000 | 7 | |
Maintenance | 7000 | 3 | ||
Monitor | 5000 | 2 |
对于此数据集,此表示更有意义。现在,如果我想看一些总数怎么办?marginins = True 可以帮助我们实现。
sum | mean | |||||
Price | Quantity | Price | Quantity | |||
Manager | Rep | Product | ||||
Debra Henley | Craig Booker | CPU | 65000 | 2 | 32500 | 1.000000 |
Maintenance | 5000 | 2 | 5000 | 2.000000 | ||
Software | 10000 | 1 | 10000 | 1.000000 | ||
Daniel Hilton | CPU | 105000 | 4 | 52500 | 2.000000 | |
Software | 10000 | 1 | 10000 | 1.000000 | ||
John Smith | CPU | 35000 | 1 | 35000 | 1.000000 | |
Maintenance | 5000 | 2 | 5000 | 2.000000 | ||
Fred Anderson | Cedric Moss | CPU | 95000 | 3 | 47500 | 1.500000 |
Maintenance | 5000 | 1 | 5000 | 1.000000 | ||
Software | 10000 | 1 | 10000 | 1.000000 | ||
Wendy Yule | CPU | 165000 | 7 | 82500 | 3.500000 | |
Maintenance | 7000 | 3 | 7000 | 3.000000 | ||
Monitor | 5000 | 2 | 5000 | 2.000000 | ||
All | 522000 | 30 | 30705 | 1.764706 |
让我们将分析提升到一个水平,并在 manager 级别查看我们的数据管道。请注意如何根据我们之前的类别定义对状态进行排序。
sum | ||
Price | ||
Manager | Status | |
Debra Henley | won | 65000 |
pending | 50000 | |
presented | 50000 | |
declined | 70000 | |
Fred Anderson | won | 172000 |
pending | 5000 | |
presented | 45000 | |
declined | 65000 | |
All | 522000 |
一个非常方便的功能是能够将字典传递给 aggfunc,因此你可以对你选择的每个值执行不同的功能。这具有使标签更清洁的作用。
Price | Quantity | ||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||
Debra Henley | won | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
presented | 30000 | 0 | 0 | 20000 | 1 | 0 | 0 | 2 | |
declined | 70000 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | |
Fred Anderson | won | 165000 | 7000 | 0 | 0 | 2 | 1 | 0 | 0 |
pending | 0 | 5000 | 0 | 0 | 0 | 1 | 0 | 0 | |
presented | 30000 | 0 | 5000 | 10000 | 1 | 0 | 1 | 1 | |
declined | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
table = pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
Price | Quantity | ||||||||||||
mean | sum | len | |||||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||||||
Debra Henley | won | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 40000 | 5000 | 0 | 0 | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
presented | 30000 | 0 | 0 | 10000 | 30000 | 0 | 0 | 20000 | 1 | 0 | 0 | 2 | |
declined | 35000 | 0 | 0 | 0 | 70000 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | |
Fred Anderson | won | 82500 | 7000 | 0 | 0 | 165000 | 7000 | 0 | 0 | 2 | 1 | 0 | 0 |
pending | 0 | 5000 | 0 | 0 | 0 | 5000 | 0 | 0 | 0 | 1 | 0 | 0 | |
presented | 30000 | 0 | 5000 | 10000 | 30000 | 0 | 5000 | 10000 | 1 | 0 | 1 | 1 | |
declined | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
尝试将这一切全部拉到一起可能看起来非常疯狂,但是一旦你开始玩数据并慢慢添加项目,你就可以了解它是如何工作的。我的一般经验法则是,一旦你使用多个 group,你应该评估一个数据透视表是否是一个有用的方法。
生成数据后,它就位于 DataFrame 中,因此你可以使用标准 DataFrame 函数对其进行过滤。
如果你只想看一个 manager:
table.query('Manager == ["Debra Henley"]')
Price | Quantity | ||||||||||||
mean | sum | len | |||||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||||||
Debra Henley | won | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 40000 | 5000 | 0 | 0 | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
presented | 30000 | 0 | 0 | 10000 | 30000 | 0 | 0 | 20000 | 1 | 0 | 0 | 2 | |
declined | 35000 | 0 | 0 | 0 | 70000 | 0 | 0 | 0 | 2 | 0 | 0 | 0 |
table.query('Status == ["pending","won"]')
Price | Quantity | ||||||||||||
mean | sum | len | |||||||||||
Product | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | CPU | Maintenance | Monitor | Software | |
Manager | Status | ||||||||||||
Debra Henley | won | 65000 | 0 | 0 | 0 | 65000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
pending | 40000 | 5000 | 0 | 0 | 40000 | 10000 | 0 | 0 | 1 | 2 | 0 | 0 | |
Fred Anderson | won | 82500 | 7000 | 0 | 0 | 165000 | 7000 | 0 | 0 | 2 | 1 | 0 | 0 |
pending | 0 | 5000 | 0 | 0 | 0 | 5000 | 0 | 0 | 0 | 1 | 0 | 0 |
这是 pivot_table 的强大功能,所以一旦你将数据转换为你需要的 pivot_table 格式,请不要忘记你拥有一个强大的功能。
为了总结所有这些,我创建了一个被王丹,希望能帮助你记住 pandas pivot_table 的使用方法,如下: