面向Tableau开发人员的Python简要介绍（第3部分）

用PYTHON探索数据 (EXPLORING DATA WITH PYTHON)

One of Tableau’s biggest advantages is how it lets you swim around in your data. You don’t always need a fine-tuned dashboard to find meaningful insights, so even someone with quite a basic understanding of Tableau can make a significant impact.

Tableau的最大优势之一是它如何使您在数据中四处游荡。您不一定总是需要经过微调的仪表板才能找到有意义的见解，因此即使是对Tableau有了基本了解的人也可以产生重大影响。

For this article, we’ll play into that theme of not needing to know everything about a tool in order to build useful things with it. In our previous article, we touched on how you can create custom calculations and color visuals in Python to arrive at visuals that look quite similar to what we build in Tableau.

在本文中，我们将以不需要使用工具来构建有用的东西的所有知识为主题。在上一篇文章中，我们谈到了如何在Python中创建自定义计算和彩色视觉效果，从而获得外观与我们在Tableau中构建的外观非常相似的视觉效果。

Today, let’s expand on what we’ve learned so far. Let’s see how we can take what we’ve seen up to this point and apply that to a common scenario: data exploration.

今天，让我们扩展到目前为止所学的知识。让我们看看如何将到目前为止所看到的内容应用到一个常见的场景：数据探索。

搭建舞台 (Setting the stage)

Previously, we took our ‘Sales’ and ‘Profit’ columns and created a new column named ‘Profit Ratio’. We then modified our plot from the first article, showing sales per product sub-category, and added color to the visual using our new ‘Profit Ratio’ metric:

以前，我们使用“销售”和“利润”列，并创建了一个名为“利润率”的新列。然后，我们从第一篇文章中修改了图表，显示了每个产品子类别的销售额，并使用新的“利润率”指标为视觉效果添加了颜色：

Whenever I teach a newcomer to Tableau how to use the software, I always plug in this visual somewhere in the mix.

每当我向Tableau教新手如何使用该软件时，我总是将此视觉效果插入组合中的某个位置。

Something I view as a common mistake in data visualization is that people often think just because they can use color they always should. If we throw color randomly at our visuals, the results are typically no better than a simple table. Some visuals that get lost in the sauce can be downright confusing.

我认为数据可视化中的一个常见错误是，人们常常认为仅仅是因为他们可以使用他们应该经常使用的颜色。如果我们在视觉上随意扔颜色，结果通常不会比简单表好。酱汁中丢失的一些视觉效果可能会令人困惑。

In our use case, color is appropriate because it helps to quickly focus in on the desired “a-ha” revelation: higher volumes of sales do not necessarily lead to higher volumes of profits.

在我们的用例中，颜色是合适的，因为它有助于快速关注所需的“ ha-ha”启示：更高的销售量并不一定会带来更高的利润。

If this data were representative of a real business, a natural question that follows this visual might be: what’s happening that’s making some of our high-sale items less profitable (let’s assume this is bad) and some of our low-sale items highly profitable (let’s assume this is good)?

如果这些数据代表了真实的业务，那么视觉上的自然问题可能是：发生了什么事，这使我们的某些高价商品获利能力下降(假设这是不好的)，而某些低价商品获利率很高。 (假设这很好)？

To provide any meaningful answers to that question, we’re going to need to do a little data exploration.

为了提供对该问题的任何有意义的答案，我们将需要做一些数据探索。

步骤1：决定着眼于我们最具影响力的客户 (Step 1: deciding to look at our most impactful customers)

First of all, context matters. The context of this data is that we are a retail store selling products. Customers can only buy what we sell, and they are buying at prices we have set. Therefore, if we have any negative profits this is a problem of our own making. Perhaps we are losing money on some products intentionally, lowering prices to attract customers whose other purchases make up for the initial loss.

首先，上下文很重要。此数据的上下文是我们是一家销售产品的零售商店。客户只能购买我们出售的产品，而他们是以我们设定的价格购买的。因此，如果我们有任何负利润，这是我们自己的问题。也许我们故意在某些产品上亏本，降低价格以吸引其他购买来弥补最初损失的客户。

My opinion is that regardless of what we discover, we want to land on something actionable. We need to be able to do something with the insights we generate, because otherwise what’s the point?

我的观点是，无论我们发现什么，我们都希望找到可行的方法。我们需要能够做我们产生了一些见解，因为否则的话有什么意义？

So rather than analyze ALL of our customers, let’s focus in on our top customers. If you have a limited budget for outreach and customer service, it often makes sense to focus those resources on the top customers. For this exercise, let’s define our ‘top customers’ as those who spend the most money per order. Let’s keep it simple and say we’re interested in getting an overview of how the sales and profitability looks for our top 10% of customers.

因此，让我们专注于主要客户，而不是分析所有客户。如果您的宣传和客户服务预算有限，那么将这些资源集中在最重要的客户上通常很有意义。在本练习中，我们将“最大客户”定义为每笔订单花费最多的客户。让我们保持简单，说我们有兴趣了解一下我们前10％的客户的销售和盈利能力概况。

For this Superstore data, that means we are looking at the top 159 customers.

对于此Superstore数据，这意味着我们正在寻找排名前159位的客户。

步骤2：为客户获取每笔订单的销售额 (Step 2: getting the sales per order for our customers)

In Tableau, we would reach for a calculated field. In Python, we will want to first create a dataframe at the appropriate level of aggregation and then add a new column storing the average sales per order.

在Tableau中，我们将到达一个计算字段。在Python中，我们将要首先在适当的聚合级别创建一个数据框，然后添加一个新列来存储每个订单的平均销售额。

We want to know the sales per order for each customer, so the appropriate level of aggregation here is at the customer level (Customer ID).

我们想知道每个客户的每个订单的销售额，因此此处适当的汇总级别是客户级别(客户ID)。

To calculate the sales per order for each customer, we will need to calculate the total sales and count the number of orders for each customer.

要计算每个客户的每个订单的销售额，我们将需要计算总销售额并计算每个客户的订单数量。

We can piece this together using a Pandas dataframe (note that ‘store_df’ is the dataframe storing all of the Superstore data):

我们可以使用Pandas数据框将其组合在一起(请注意，“ store_df”是存储所有Superstore数据的数据框)：

orders_per_customer_df = store_df\
.groupby('Customer ID')\
.agg({
         
    'Order ID': pd.Series.nunique,
    'Sales': 'sum'
})\
.reset_index()\
.rename(columns={'Order ID': 'order_count'})\
.sort_values('Sales', ascending=False)

So let’s read through this line by line like it’s a book:

因此，让我们像本书一样逐行阅读：

our output will be stored in a variable named ‘orders_per_customer_df’
我们的输出将存储在名为“ orders_per_customer_df”的变量中
we are grouping our store data by the ‘Customer ID’ column
我们正在按“客户ID”列对商店数据进行分组
we are aggregating two columns; a unique count of ‘Order ID’ (represented by the pd.Series.nunique function)and the sum of ‘Sales’
我们正在汇总两列； “订单ID”的唯一计数(由pd.Series.nunique函数表示)和“销售”的总和
we are resetting the index of the resulting dataframe; if you have no idea what this means then try running the code without that statement and see how the results differ!
我们正在重置结果数据帧的索引；如果您不知道这意味着什么，请尝试在不使用该语句的情况下运行代码，然后看看结果有何不同！
we are renaming the ‘Order ID’ column to be ‘order_count’, which is a more appropriate name given we are no longer looking at the actual ID values
我们将“订单ID”列重命名为“ order_count”，这是一个更合适的名称，因为我们不再查看实际的ID值
we are sorting the resulting dataframe by Sales, with highest values at the top
我们将按Sales排序结果数据框，最高值在顶部

步骤3：计算每个客户的每个订单的总销售额 (Step 3: calculate the total sales per order for each customer)

First, let’s admire a snippet of the dataframe we created in the previous step:

首先，让我们欣赏上一步中创建的数据框的片段：

Alright, so we have a dataframe with our customers, the number of orders those customers made, and the total amount spent.

好了，因此我们拥有一个与客户的数据框，这些客户所下的订单数量以及总花费。

To calculate the total sales per order, all we need to do is divide the ‘Sales’ column by the ‘order_count’ column.

要计算每个订单的总销售额，我们要做的就是将“销售”列除以“ order_count”列。

orders_per_customer_df['avg_sales_per_order'] = \
orders_per_customer_df['Sales'] / orders_per_customer_df['order_count']

In the code snippet above, we have defined a new column ‘avg_sales_per_order’ and set that to be the result of our ‘Sales’ column divided by our ‘order_count’ column.

在上面的代码段中，我们定义了一个新列“ avg_sales_per_order”，并将其设置为“销售”列除以“ order_count”列的结果。

Here’s what that looks like, now sorted in descending order by ‘avg_sales_per_order’.

看起来像这样，现在按降序按“ avg_sales_per_order”排序。

You can use the .head() function on a dataframe to get a preview like this! 您可以在数据框上使用.head()函数来获得这样的预览！

步骤4：查看订单数分布 (Step 4: taking a look at the order count distribution)

Let’s do a quick side-quest. Comparing the order counts from the beginning of step 3 and the end of step 3, it seems there might be a lot of variation in the order counts of our customers.

让我们做一个简短的旁听。比较步骤3的开始和步骤3的结束之间的订单数，看来我们客户的订单数可能会有很多差异。

In an effort to better understand customer behavior, let’s visualize the distribution of order counts.

为了更好地了解客户行为，让我们可视化订单计数的分布。

sns.distplot(orders_per_customer_df['order_count'])

Running the line of code above gives us this:

运行上面的代码行可以使我们做到这一点：

Distribution of order counts for all customers 为所有客户分配订单数量

Ah, it looks like customers are naturally grouped into two camps. One group averages about 5 orders and the other averages about 25 orders.

嗯，看来客户自然被分为两个阵营。一组平均约5个订单，另一组平均约25个订单。

This could be a reflection of our product mix; some customers may purchase a small amount of expensive items, while other customers shop more frequently for less expensive products.

这可能反映了我们的产品组合；一些客户可能会购买少量昂贵的商品，而其他客户会更频繁地购买价格较便宜的产品。

第4步：过滤数据以仅考虑前10％的客户 (Step 4: filter our data to only consider the top 10% of customers)

In Tableau, one way to do this would be to create a filter for the ‘Customer ID’ field based on the calculated field for ‘avg_sales_per_order’ and only include the top 159 results.

在Tableau中，一种方法是基于“ avg_sales_per_order”的计算字段为“客户ID”字段创建过滤器，并且仅包括前159个结果。

In Python, one potential solution is this:

在Python中，一种可能的解决方案是：

top_customers_df = \
store_df[store_df['Customer ID']\
.isin(orders_per_customer_df.head(159)['Customer ID'])]

Let’s dissect this.

让我们对此进行剖析。

First of all, we are storing the results in a variable named ‘top_customers_df’.

首先，我们将结果存储在名为“ top_customers_df”的变量中。

Second, we are using Pandas dataframe notation to essentially say “give us all rows of the ‘store_df’ dataframe that satisfy this condition.” In our case, the condition needing to be satisfied is that any given ‘Customer ID’ encountered must also be in the ‘Customer ID’ column for the top 159 rows of our ‘orders_per_customer_df’ dataframe.

其次，我们使用Pandas数据框表示法来表示“给我们满足该条件的'store_df'数据框的所有行。” 在我们的情况下，需要满足的条件是，遇到的任何给定“客户ID”也必须位于“ orders_per_customer_df”数据帧的前159行的“客户ID”列中。

In other words, we are filtering our store data such that we are only seeing data for the customer ID values seen in the top 159 rows of the dataframe that holds our top customers.

换句话说，我们正在过滤商店数据，以便只看到在拥有最大客户的数据框的前159行中看到的客户ID值的数据。

Go back and check out how we defined that ‘orders_per_customer_df’ dataframe if this is not clicking. Keep in mind that the .head(x) function returns the top x number of rows for any dataframe calling it.

返回并查看我们如何定义“ orders_per_customer_df”数据框(如果未单击的话)。请记住， .head(x)函数返回任何调用它的数据框的前x行数。

步骤5：预览主要客户的数据 (Step 5: preview the data for our top customers)

Now let’s take a quick look at the results, having filtered our store data to only include the top 159 customers. Let’s group this by ‘Sub-Category’ and aggregate the data to see total sales, average discount percentage, total profit, and the total number of orders.

现在，让我们快速浏览一下结果，已筛选出商店数据，仅包括排名前159位的客户。让我们按“子类别”将其分组并汇总数据以查看总销售额，平均折扣率，总利润和订单总数。

subcat_top_customers_df = top_customers_df\
.groupby('Sub-Category')\
.agg({
         
    'Sales': 'sum',
    'Discount': 'mean',
    'Profit': 'sum',
    'Order ID': pd.Series.nunique
})\
.rename(columns={
         
    'Discount': 'Avg Discount',
    'Order ID': 'Num Orders'
})\
.sort_values('Profit', ascending=False)\
.reset_index()

Are you getting a feel for how these Pandas dataframes work? I recommend plugging the code in yourself and playing around with it. What happens if you change ‘sum’ to ‘median’, ‘min’, or ‘max’?

您对这些Pandas数据框的工作方式有感觉吗？我建议您自己插入代码并进行尝试。如果将“ sum”更改为“ median”，“ min”或“ max”会发生什么？

第6步：为前10％的客户显示结果 (Step 6: visualize the results for the top 10% of customers)

Before looking at the code to create this visual, let’s take a moment to soak it in. What are we looking at?

在查看创建视觉效果的代码之前，让我们花点时间将其浸入。我们在看什么？

On the left, we see total profits by sub-category. All of the bars where profits are above zero are colored blue. There is only one unprofitable sub-category here, and it’s ‘Tables’.

在左侧，我们按子类别看到总利润。利润高于零的所有条形都被涂成蓝色。这里只有一个无利可图的子类别，它是“表格”。

Remember ‘Tables’ from the previous article? It was the sub-category screaming for attention. What’s interesting here is that Tables isn’t just unprofitable, it’s unprofitable even when looking at our top 10% of customers. Why is it unprofitable?

还记得上一篇文章中的“表格”吗？这是子类别，引起人们的注意。这里有趣的是，Tables不仅是无利可图的，甚至在我们的前10％的客户中也无利可图。为什么它无利可图？

That’s where the visual on the right-hand side comes in highlighting the average discount. Once again, ‘Tables’ is screaming for attention, and here we can see that some abnormally high discounts are being given on our tables.

这就是右侧视觉效果突出显示平均折扣的地方。再次，“桌子”尖叫着引起注意，在这里我们可以看到我们的桌子上有些异常高的折扣。

There may be a good business reason to heavily discount tables (perhaps it’s a loss leader), but at least now we know why tables are unprofitable: they are being discounted at >25%, which is much higher than any other product sub-category.

大量折扣表(也许是亏损的领先者)可能是一个很好的商业理由，但至少现在我们知道为什么该表无利可图了：它们的折扣率> 25％，远高于其他任何产品子类别。

第7步：了解视觉效果如何融合 (Step 7: understand how the visuals came together)

Here’s how the visual above works:

这是上面的视觉效果的工作方式：

fig, axs = plt.subplots(1, 2, figsize=(16, 8), sharey=True)sns.barplot(data=subcat_top_customers_df, 
            x='Profit', y='Sub-Category', ax=axs[0],
            palette=cm.RdBu(subcat_top_customers_df['Profit']), ci=False)sns.barplot(data=subcat_top_customers_df, 
            x='Avg Discount', y='Sub-Category', ax=axs[1],
            palette=cm.RdBu(subcat_top_customers_df['Avg Discount'] * 5.5), ci=False)axs[0].tick_params(axis='both', which='both', length=0)
axs[1].tick_params(axis='both', which='both', length=0)
axs[1].set_ylabel('')sns.despine(left=True, bottom=True)

The first line is something you’ll often see when working with Python plotting libraries. We are establishing a figure and a set of axes. The figure is like the canvas on which the visuals will live, and the axes act as the spine of our numerical and categorical data. In the first line we define a figure with 2 pieces of real estate: 1 row with two plots side by side. The ‘sharey’ parameter says that both visuals will share a y-axis, so there is no need to list out the sub-categories twice.

第一行是使用Python绘图库时经常看到的内容。我们正在建立一个图形和一组轴。该图就像是将在其上显示视觉效果的画布，并且轴是我们的数字和分类数据的脊柱。在第一行中，我们定义了一个包含2个不动产的图形：1行并排包含两个地块。 “ sharey”参数表示两个视觉效果都将共享y轴，因此无需两次列出子类别。

We then call on the Seaborn library, which was imported earlier under the alias ‘sns’, to plot our bar graphs. Here we define the dataframe providing our data, the columns which will provide our x-axis and y-axis, and the color gradient for each visual. The ‘ci’ parameter is set to ‘False’ to remove extra lines that would otherwise appear to show us confidence intervals. Go ahead and flip that to ‘True’ and see how the visual changes.

然后，我们调用Seaborn库(该数据库较早以别名“ sns”导入)来绘制条形图。在这里，我们定义了提供数据的数据框，将提供我们的x轴和y轴的列以及每个视觉图像的颜色渐变。 “ ci”参数设置为“ False”以删除多余的行，否则这些行似乎向我们显示置信区间。继续并将其翻转为“ True”，并查看外观如何变化。

The final lines are cosmetic formatting, getting rid of tick marks and such. I highly recommend tinkering with this to get comfortable with the concept of formatting through code rather than through a click-and-drag interface like Tableau. Something nice about defining formatting through code is that you can build reusable functions that always apply your favorite formatting tricks.

最后一行是修饰格式，去除刻度线等。我强烈建议对此进行修改，以使您熟悉通过代码而不是通过诸如Tableau之类的单击和拖动界面进行格式化的概念。通过代码定义格式的好处是，您可以构建可重用的功能，这些功能始终应用您喜欢的格式技巧。

Wrapping it up

结语

So, here we are. Our attention-seeking tables got the attention they wanted all along, and we got a bit more exposure to shaping and controlling our data using Pandas dataframes.

所以，我们到了。我们的关注表一直吸引着他们一直想要的关注，而使用Pandas数据框来整形和控制数据的机会也更多了。

The data exploration here wasn’t earth-shattering, but if your comfortable with the Python code we’ve written up until now, then you’re ready to dive into our next session!

此处的数据探索并不是破天荒的事，但是如果您满意我们到目前为止所编写的Python代码，那么您就可以开始进行下一个工作了！

Tune in next time, when we’ll explore joining data from different tables. Pandas makes this quite easy for us, so it’ll be a breeze for us to join our ‘Orders’ data to our ‘Returns’ data and answer the question: how many of our products have been returned? What percentage of our revenue is disappearing due to returns?

下次，当我们将探索来自不同表的联接数据时，请进行调整。熊猫对我们而言非常容易，因此将“订单”数据与“退货”数据结合起来并回答以下问题将是一件轻而易举的事情：已经退回了多少产品？由于退货，我们收入的百分之几消失了吗？

Hope to see you there!

希望在那里见到你！

翻译自: https://towardsdatascience.com/a-gentle-introduction-to-python-for-tableau-developers-part-3-8634fa5b9dec