python 数据透视表_使用Python数据透视表探索幸福数据

python 数据透视表

python 数据透视表_使用Python数据透视表探索幸福数据_第1张图片

One of the biggest challenges when facing a new data set is knowing where to start and what to focus on. Being able to quickly summarize hundreds of rows and columns can save you a lot of time and frustration. A simple tool you can use to achieve this is a pivot table, which helps you slice, filter, and group data at the speed of inquiry and represent the information in a visually appealing way.

面对新数据集时最大的挑战之一就是要知道从哪里开始以及应该关注什么。 能够快速汇总成百上千的行和列可以节省大量时间和精力。 数据透视表是您可以用来实现此目的的简单工具,它可以帮助您以查询的速度对数据进行切片,过滤和分组,并以视觉上有吸引力的方式表示信息。

数据透视表有什么用? (Pivot table, what is it good for?)

You may already be familiar with the concept of pivot tables from Excel, where they were introduced in 1994 by the trademarked name PivotTable. This tool enabled users to automatically sort, count, total, or average the data stored in one table. In the image below we used the PivotTable functionality to quickly summarize the Titanic data set. The larger table below displays the first ~30 rows of the data set, and the smaller tables are the PivotTables we created.

您可能已经熟悉Excel中的数据透视表的概念,该概念在1994年由商标名称PivotTable引入。 使用此工具,用户可以自动对一个表中存储的数据进行排序,计数,总计或平均。 在下图中,我们使用了数据透视表功能来快速汇总Titanic数据集。 下面较大的表显示了数据集的前30行,较小的表是我们创建的数据透视表。

python 数据透视表_使用Python数据透视表探索幸福数据_第2张图片

The pivot table on the left grouped the data according to the Sex and Survived column. As a result, this table displays the percentage of each gender among the different survival status (0: Didn’t survive, 1: Survived). This allows us to quickly see that women had better chances of survival than men. The table on the right also uses the Survived column, but this time the data is grouped by Class.

左侧的数据透视表根据“ SexSurvived列对数据进行了分组。 结果,此表显示了每种性别在不同生存状态中所占的百分比( 0 :未生存, 1 :生存)。 这使我们能够Swift看到女性比男性拥有更好的生存机会。 右侧的表也使用了Survived列,但是这次数据是按Class分组的。

介绍我们的数据集:《世界幸福报告》 (Introducing our data set: World Happiness Report)

We used Excel for the above examples, but this post will demonstrate the advantages of the built-in pandas function pivot_table built in function in Pandas. We’ll use the World Happiness Report, which is a survey about the state of global happiness. The report ranks more than 150 countries by their happiness levels, and has been published almost every year since 2012. We’ll use data collected in the years 2015, 2016, and 2017, which is available for download if you’d like to follow along. We’re running python 3.6 and pandas 0.19.

在上面的示例中,我们使用了Excel,但是本文将演示内置熊猫函数内置的功能pivot_table 。 我们将使用《 世界幸福报告》 ,该报告是关于全球幸福状况的调查。 该报告按幸福等级对150多个国家/地区进行排名,自2012年以来几乎每年都会发布。我们将使用2015年,2016年和2017年收集的数据,如果您想了解的话可以下载。沿。 我们正在运行python 3.6和pandas 0.19。

Some interesting questions we might like to answer are:

我们可能想回答的一些有趣的问题是:

  • Which are the happiest and least happy countries and regions in the world?
  • Is happiness affected by region?
  • Did the happiness score change significantly over the past three years?
  • 世界上最幸福和最不开心的国家和地区是?
  • 幸福会受到地区的影响吗?
  • 在过去三年中,幸福感得分是否发生了显着变化?

Let’s import our data and take a quick first look:

让我们导入数据并快速浏览一下:

import import pandas pandas as as pd
pd
import import numpy numpy as as np
np
# reading the data
# reading the data
data data = = pdpd .. read_csvread_csv (( 'data.csv''data.csv' , , index_colindex_col == 00 )
)
# sort the df by ascending years and descending happiness scores
# sort the df by ascending years and descending happiness scores
datadata .. sort_valuessort_values ([([ 'Year''Year' , , "Happiness Score""Happiness Score" ], ], ascendingascending == [[ TrueTrue , , FalseFalse ], ], inplaceinplace == TrueTrue )
)
#diplay first 10 rows
#diplay first 10 rows
datadata .. headhead (( 1010 )
)
Country 国家 Region 地区 Happiness Rank 幸福等级 Happiness Score 幸福分数 Economy (GDP per Capita) 经济(人均GDP) Family 家庭 Health (Life Expectancy) 健康(预期寿命) Freedom 自由 Trust (Government Corruption) 信任(政府腐败) Generosity 慷慨大方 Dystopia Residual 反乌托邦残渣 Year
141 141 Switzerland 瑞士 Western Europe 西欧 1.0 1.0 7.587 7.587 1.39651 1.39651 1.34951 1.34951 0.94143 0.94143 0.66557 0.66557 0.41978 0.41978 0.29678 0.29678 2.51738 2.51738 2015 2015年
60 60 Iceland 冰岛 Western Europe 西欧 2.0 2.0 7.561 7.561 1.30232 1.30232 1.40223 1.40223 0.94784 0.94784 0.62877 0.62877 0.14145 0.14145 0.43630 0.43630 2.70201 2.70201 2015 2015年
38 38 Denmark 丹麦 Western Europe 西欧 3.0 3.0 7.527 7.527 1.32548 1.32548 1.36058 1.36058 0.87464 0.87464 0.64938 0.64938 0.48357 0.48357 0.34139 0.34139 2.49204 2.49204 2015 2015年
108 108 Norway 挪威 Western Europe 西欧 4.0 4.0 7.522 7.522 1.45900 1.45900 1.33095 1.33095 0.88521 0.88521 0.66973 0.66973 0.36503 0.36503 0.34699 0.34699 2.46531 2.46531 2015 2015年
25 25 Canada 加拿大 North America 北美 5.0 5.0 7.427 7.427 1.32629 1.32629 1.32261 1.32261 0.90563 0.90563 0.63297 0.63297 0.32957 0.32957 0.45811 0.45811 2.45176 2.45176 2015 2015年
46 46 Finland 芬兰 Western Europe 西欧 6.0 6.0 7.406 7.406 1.29025 1.29025 1.31826 1.31826 0.88911 0.88911 0.64169 0.64169 0.41372 0.41372 0.23351 0.23351 2.61955 2.61955 2015 2015年
102 102 Netherlands 荷兰 Western Europe 西欧 7.0 7.0 7.378 7.378 1.32944 1.32944 1.28017 1.28017 0.89284 0.89284 0.61576 0.61576 0.31814 0.31814 0.47610 0.47610 2.46570 2.46570 2015 2015年
140 140 Sweden 瑞典 Western Europe 西欧 8.0 8.0 7.364 7.364 1.33171 1.33171 1.28907 1.28907 0.91087 0.91087 0.65980 0.65980 0.43844 0.43844 0.36262 0.36262 2.37119 2.37119 2015 2015年
103 103 New Zealand 新西兰 Australia and New Zealand 澳大利亚和新西兰 9.0 9.0 7.286 7.286 1.25018 1.25018 1.31967 1.31967 0.90837 0.90837 0.63938 0.63938 0.42922 0.42922 0.47501 0.47501 2.26425 2.26425 2015 2015年
6 6 Australia 澳大利亚 Australia and New Zealand 澳大利亚和新西兰 10.0 10.0 7.284 7.284 1.33358 1.33358 1.30923 1.30923 0.93156 0.93156 0.65124 0.65124 0.35637 0.35637 0.43562 0.43562 2.26646 2.26646 2015 2015年

Each country’s Happiness Score is calculated by summing the seven other variables in the table. Each of these variables reveals a population-weighted average score on a scale running from 0 to 10, that is tracked over time and compared against other countries.

每个国家的Happiness Score是通过将表格中的其他七个变量相加得出的。 这些变量中的每一个都揭示了人口加权的平均得分,范围从0到10,随时间推移进行追踪,并与其他国家进行比较。

These variables are:

这些变量是:

  • Economy: real GDP per capita
  • Family: social support
  • Health: healthy life expectancy
  • Freedom: freedom to make life choices
  • Trust: perceptions of corruption
  • Generosity: perceptions of generosity
  • Dystopia: each country is compared against a hypothetical nation that represents the lowest national averages for each key variable and is, along with residual error, used as a regression benchmark
  • Economy :人均实际国内生产总值
  • Family :社会支持
  • Health :健康的预期寿命
  • Freedom :自由选择生活
  • Trust :对腐败的看法
  • Generosity :对慷慨的看法
  • Dystopia :将每个国家与一个假设国家进行比较,该国家代表每个关键变量的最低全国平均水平,并与剩余误差一起用作回归基准

Each country’s Happiness Score determines its Happiness Rank – which is its relative position among other countries in that specific year. For example, the first row indicates that Switzerland was ranked the happiest country in 2015 with a happiness score of 7.587. Switzerland was ranked first just before Iceland, which scored 7.561. Denmark was ranked third in 2015, and so on. It’s interesting to note that Western Europe took seven of the top eight rankings in 2015.

每个国家的Happiness Score决定其Happiness Rank -这是该年在其他国家中的相对排名。 例如,第一行表示瑞士在2015年的幸福感得分为7.587,是最幸福的国家。 瑞士排名第一,仅次于冰岛,得分为7.561。 丹麦在2015年排名第三,依此类推。 有趣的是,西欧在2015年的前八名中排名七。

We’ll concentrate on the final Happiness Score to demonstrate the technical aspects of pivot table.

我们将集中在最终的Happiness Score以演示数据透视表的技术方面。


Our data has 495 rows and 12 columns
Are there missing values? True

Happiness Rank 幸福等级 Happiness Score 幸福分数 Economy (GDP per Capita) 经济(人均GDP) Family 家庭 Health (Life Expectancy) 健康(预期寿命) Freedom 自由 Trust (Government Corruption) 信任(政府腐败) Generosity 慷慨大方 Dystopia Residual 反乌托邦残渣 Year
count 计数 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 470.000000 495.000000 495.000000
mean 意思 78.829787 78.829787 5.370728 5.370728 0.927830 0.927830 0.990347 0.990347 0.579968 0.579968 0.402828 0.402828 0.134790 0.134790 0.242241 0.242241 2.092717 2.092717 2016.000000 2016.000000
std 性病 45.281408 45.281408 1.136998 1.136998 0.415584 0.415584 0.318707 0.318707 0.240161 0.240161 0.150356 0.150356 0.111313 0.111313 0.131543 0.131543 0.565772 0.565772 0.817323 0.817323
min 1.000000 1.000000 2.693000 2.693000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.328580 0.328580 2015.000000 2015.000000
25% 25% 40.000000 40.000000 4.509000 4.509000 0.605292 0.605292 0.793000 0.793000 0.402301 0.402301 0.297615 0.297615 0.059777 0.059777 0.152831 0.152831 1.737975 1.737975 2015.000000 2015.000000
50% 50% 79.000000 79.000000 5.282500 5.282500 0.995439 0.995439 1.025665 1.025665 0.630053 0.630053 0.418347 0.418347 0.099502 0.099502 0.223140 0.223140 2.094640 2.094640 2016.000000 2016.000000
75% 75% 118.000000 118.000000 6.233750 6.233750 1.252443 1.252443 1.228745 1.228745 0.768298 0.768298 0.516850 0.516850 0.173161 0.173161 0.315824 0.315824 2.455575 2.455575 2017.000000 2017.000000
max 最高 158.000000 158.000000 7.587000 7.587000 1.870766 1.870766 1.610574 1.610574 1.025250 1.025250 0.669730 0.669730 0.551910 0.551910 0.838075 0.838075 3.837720 3.837720 2017.000000 2017.000000

The describe() method reveals that Happiness Rank ranges from 1 to 158, which means that the largest number of surveyed countries for a given year was 158. It’s worth noting that Happiness Rank was originally of type int. The fact it’s displayed as a float here implies we have NaN values in this column (we can also determine this by the count row which only amounts to 470 as opposed to the 495 rows in our data set).

describe()方法显示, Happiness Rank范围是1到158,这意味着在给定年份中被调查的国家最多,为158。值得注意的是, Happiness Rank最初是int类型的。 它在此处显示为浮点数的事实意味着我们在此列中具有NaN值(我们也可以通过count行(仅相当于470行而不是数据集中的495行)来确定该值)。

The Year column doesn’t have any missing values. Firstly, because it’s displayed in the data set as int, but also – the count for Year amounts to 495 which is the number of rows in our data set. By comparing the count value for Year to the other columns, it seems we can expect 25 missing values in each column (495 in Year VS. 470 in all other columns).

Year列没有任何缺失值。 首先,由于它在数据集中显示为int ,而且-– Year的总数为495,这是我们数据集中的行数。 通过将Yearcount数值与其他列进行比较,我们似乎可以预期每列中有25个缺失值( Year VS470。所有其他列为470)。

YearRegion对数据进行分类 (Categorizing the data by Year and Region)

The fun thing about pandas pivot_table is you can get another point of view on your data with only one line of code. Most of the pivot_table parameters use default values, so the only mandatory parameters you must add are data and index. Though it isn’t mandatory, we’ll also use the value parameter in the next example.

熊猫pivot_table的有趣之处在于,您只需要一行代码就可以在数据上获得另一种观点。 大多数pivot_table参数使用默认值,因此必须添加的唯一必需参数是dataindex 。 尽管不是强制性的,但在下一个示例中我们还将使用value参数。

  • data is self explanatory – it’s the DataFrame you’d like to use
  • index is the column, grouper, array (or list of the previous) you’d like to group your data by. It will be displayed in the index column (or columns, if you’re passing in a list)
  • values (optional) is the column you’d like to aggregate. If you do not specify this then the function will aggregate all numeric columns.
  • data是不言自明的–这是您要使用的DataFrame
  • index是您要对数据进行分组的列,分组器,数组(或上一个列表)。 它将显示在索引列中(如果要传递列表,则显示在列中)
  • values (可选)是您要汇总的列。 如果您未指定此选项,则该函数将汇总所有数字列。

Let’s first look at the output, and then explain how the table was produced:

让我们首先看一下输出,然后解释该表是如何产生的:

pdpd .. pivot_tablepivot_table (( datadata , , indexindex = = 'Year''Year' , , valuesvalues = = "Happiness Score""Happiness Score" )
)
Happiness Score 幸福分数
Year
2015 2015年 5.375734 5.375734
2016 2016年 5.382185 5.382185
2017 2017年 5.354019 5.354019

By passing Year as the index parameter, we chose to group our data by Year. The output is a pivot table that displays the three different values for Year as index, and the Happiness Score as values. It’s worth noting that the aggregation default value is mean (or average), so the values displayed in the Happiness Score column are the yearly average for all countries. The table shows the average for all countries was highest in 2016, and is currently the lowest in the past three years.

通过将Year作为index参数,我们选择按Year对数据进行分组。 输出是一个数据透视表,该表将Year的三个不同值显示为index ,将Happiness Scorevalues 。 值得注意的是,聚合默认值是平均值(或平均值),因此“ Happiness Score列中显示的值是所有国家/地区的年平均值。 该表显示所有国家/地区的平均值在2016年最高,目前是过去三年中最低的。

Here’s a detailed diagram of how this pivot table was created:

以下是此数据透视表的创建方式的详细示意图:

python 数据透视表_使用Python数据透视表探索幸福数据_第3张图片

Next, let’s use the Region column as index:

接下来,让我们使用Region列作为index

Happiness Score 幸福分数
Region 地区
Australia and New Zealand 澳大利亚和新西兰 7.302500 7.302500
Central and Eastern Europe 中欧和东欧 5.371184 5.371184
Eastern Asia 东亚 5.632333 5.632333
Latin America and Caribbean 拉丁美洲和加勒比 6.069074 6.069074
Middle East and Northern Africa 中东和北非 5.387879 5.387879
North America 北美 7.227167 7.227167
Southeastern Asia 东南亚 5.364077 5.364077
Southern Asia 南亚 4.590857 4.590857
Sub-Saharan Africa 撒哈拉以南非洲 4.150957 4.150957
Western Europe 西欧 6.693000 6.693000

The numbers displayed in the Happiness Score column in the pivot table above are the mean, just as before – but this time it’s each region’s mean for all years documented (2015, 2016, 2017). This display makes it easier to see Australia and New Zealand have the highest average score, while North America is ranked close behind. It’s interesting that despite the initial impression we got from reading the data, which showed Western Europe in most of the top places, Western Europe is actually ranked third when calculating the average for the past three years. The lowest ranked region is Sub-Saharan Africa, and close behind is Southern Asia.

像以前一样,在上方数据透视表的“ Happiness Score列中显示的数字是平均值,但这一次是记录的所有年份(2015、2016、2017)的每个区域的平均值。 通过此显示,可以更轻松地看到Australia and New Zealand的平均得分最高,而North America排名第二。 有趣的是,尽管我们从读取的数据中获得了最初的印象,该数据显示Western Europe在大多数Western Europe中排名第一,但在计算过去三年的平均值时, Western Europe实际上排名第三。 排名最低的地区是Sub-Saharan Africa ,紧随其后的是Southern Asia

创建一个多索引数据透视表 (Creating a multi-index pivot table)

You may have used groupby() to achieve some of the pivot table functionality (we’ve previously demonstrated how to use groupby() to analyze your data). However, the pivot_table() built-in function offers straightforward parameter names and default values that can help simplify complex procedures like multi-indexing.

您可能已经使用groupby()来实现某些数据透视表功能(我们之前已经演示了如何使用groupby()分析数据)。 但是, pivot_table()内置函数提供了直接的参数名称和默认值,可以帮助简化诸如多索引之类的复杂过程。

In order to group the data by more than one column, all we have to do is pass in a list of column names. Let’s categorize the data by Region and Year.

为了将数据按不止一列进行分组,我们要做的就是传递列名列表。 让我们按RegionYear对数据进行分类。

pdpd .. pivot_tablepivot_table (( datadata , , index index = = [[ 'Region''Region' , , 'Year''Year' ], ], valuesvalues == "Happiness Score""Happiness Score" )
)
Happiness Score 幸福分数
Region 地区 Year
Australia and New Zealand 澳大利亚和新西兰 2015 2015年 7.285000 7.285000
2016 2016年 7.323500 7.323500
2017 2017年 7.299000 7.299000
Central and Eastern Europe 中欧和东欧 2015 2015年 5.332931 5.332931
2016 2016年 5.370690 5.370690
2017 2017年 5.409931 5.409931
Eastern Asia 东亚 2015 2015年 5.626167 5.626167
2016 2016年 5.624167 5.624167
2017 2017年 5.646667 5.646667
Latin America and Caribbean 拉丁美洲和加勒比 2015 2015年 6.144682 6.144682
2016 2016年 6.101750 6.101750
2017 2017年 5.957818 5.957818
Middle East and Northern Africa 中东和北非 2015 2015年 5.406900 5.406900
2016 2016年 5.386053 5.386053
2017 2017年 5.369684 5.369684
North America 北美 2015 2015年 7.273000 7.273000
2016 2016年 7.254000 7.254000
2017 2017年 7.154500 7.154500
Southeastern Asia 东南亚 2015 2015年 5.317444 5.317444
2016 2016年 5.338889 5.338889
2017 2017年 5.444875 5.444875
Southern Asia 南亚 2015 2015年 4.580857 4.580857
2016 2016年 4.563286 4.563286
2017 2017年 4.628429 4.628429
Sub-Saharan Africa 撒哈拉以南非洲 2015 2015年 4.202800 4.202800
2016 2016年 4.136421 4.136421
2017 2017年 4.111949 4.111949
Western Europe 西欧 2015 2015年 6.689619 6.689619
2016 2016年 6.685667 6.685667
2017 2017年 6.703714 6.703714

These examples also reveal where pivot table got its name from: it allows you to rotate or pivot the summary table, and this rotation gives us a different perspective of the data. A perspective that can very well help you quickly gain valuable insights.

这些示例还揭示了数据透视表的名称来自何处:它允许您旋转或旋转汇总表,并且这种旋转为我们提供了数据的不同视角。 可以很好地帮助您快速获得宝贵见解的观点。

This is one way to look at the data, but we can use the columns parameter to get a better display:

这是查看数据的一种方法,但是我们可以使用columns参数来获得更好的显示:

  • columns is the column, grouper, array, or list of the previous you’d like to group your data by. Using it will spread the different values horizontally.
  • columns是您想要对数据进行分组的前一个列,分组器,数组或列表。 使用它会水平分散不同的值。

Using Year as the Columns argument will display the different values for year, and will make for a much better display, like so:

使用Year作为Columns参数将显示year的不同值,并使显示效果更好,如下所示:

Year 2015 2015年 2016 2016年 2017 2017年
Region 地区
Australia and New Zealand 澳大利亚和新西兰 7.285000 7.285000 7.323500 7.323500 7.299000 7.299000
Central and Eastern Europe 中欧和东欧 5.332931 5.332931 5.370690 5.370690 5.409931 5.409931
Eastern Asia 东亚 5.626167 5.626167 5.624167 5.624167 5.646667 5.646667
Latin America and Caribbean 拉丁美洲和加勒比 6.144682 6.144682 6.101750 6.101750 5.957818 5.957818
Middle East and Northern Africa 中东和北非 5.406900 5.406900 5.386053 5.386053 5.369684 5.369684
North America 北美 7.273000 7.273000 7.254000 7.254000 7.154500 7.154500
Southeastern Asia 东南亚 5.317444 5.317444 5.338889 5.338889 5.444875 5.444875
Southern Asia 南亚 4.580857 4.580857 4.563286 4.563286 4.628429 4.628429
Sub-Saharan Africa 撒哈拉以南非洲 4.202800 4.202800 4.136421 4.136421 4.111949 4.111949
Western Europe 西欧 6.689619 6.689619 6.685667 6.685667 6.703714 6.703714

使用plot()可视化数据透视表 (Visualizing the pivot table using plot())

If you want to look at the visual representation of the previous pivot table we created, all you need to do is add plot() at the end of the pivot_table function call (you’ll also need to import the relevant plotting libraries).

如果要查看我们创建的上一个数据透视表的外观,只需在pivot_table函数调用的末尾添加plot() (您还需要导入相关的绘图库)。

%% matplotlib inline
matplotlib inline
import import matplotlib.pyplot matplotlib.pyplot as as plt
plt
import import seaborn seaborn as as sns
sns
# use Seaborn styles
# use Seaborn styles
snssns .. setset ()  

()  

pdpd .. pivot_tablepivot_table (( datadata , , indexindex = = 'Region''Region' , , columnscolumns = = 'Year''Year' , , valuesvalues = = "Happiness Score""Happiness Score" )) .. plotplot (( kindkind = = 'bar''bar' )
)
pltplt .. ylabelylabel (( "Happiness Rank""Happiness Rank" )
)


The visual representation helps reveal that the differences are minor. Having said that, this also shows that there’s a permanent decrease in the Happiness rank of both of the regions located in America.

视觉表示有助于揭示差异很小。 话虽如此,这也表明位于美国的两个地区的幸福度都在持续下降。

使用aggfunc处理数据 (Manipulating the data using aggfunc)

Up until now we’ve used the average to get insights about the data, but there are other important values to consider. Time to experiment with the aggfunc parameter:

到目前为止,我们已经使用平均值来获取有关数据的见解,但是还需要考虑其他重要值。 是时候尝试使用aggfunc参数了:

  • aggfunc (optional) accepts a function or list of functions you’d like to use on your group (default: numpy.mean). If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names.
  • aggfunc (可选)接受您要在组中使用的功能或功能列表(默认值: numpy.mean )。 如果传递了函数列表,则结果数据透视表将具有层次结构列,其顶级是函数名称。

Let’s add the median, minimum, maximum, and the standard deviation for each region. This can help us evaluate how accurate the average is, and if it’s really representative of the real picture.

让我们添加每个区域的中位数,最小值,最大值和标准偏差。 这可以帮助我们评估平均值的准确性,以及它是否真的可以代表真实情况。

mean 意思 median 中位数 min max 最高 std 性病
Happiness Score 幸福分数 Happiness Score 幸福分数 Happiness Score 幸福分数 Happiness Score 幸福分数 Happiness Score 幸福分数
Region 地区
Australia and New Zealand 澳大利亚和新西兰 7.302500 7.302500 7.2995 7.2995 7.284 7.284 7.334 7.334 0.020936 0.020936
Central and Eastern Europe 中欧和东欧 5.371184 5.371184 5.4010 5.4010 4.096 4.096 6.609 6.609 0.578274 0.578274
Eastern Asia 东亚 5.632333 5.632333 5.6545 5.6545 4.874 4.874 6.422 6.422 0.502100 0.502100
Latin America and Caribbean 拉丁美洲和加勒比 6.069074 6.069074 6.1265 6.1265 3.603 3.603 7.226 7.226 0.728157 0.728157
Middle East and Northern Africa 中东和北非 5.387879 5.387879 5.3175 5.3175 3.006 3.006 7.278 7.278 1.031656 1.031656
North America 北美 7.227167 7.227167 7.2175 7.2175 6.993 6.993 7.427 7.427 0.179331 0.179331
Southeastern Asia 东南亚 5.364077 5.364077 5.2965 5.2965 3.819 3.819 6.798 6.798 0.882637 0.882637
Southern Asia 南亚 4.590857 4.590857 4.6080 4.6080 3.360 3.360 5.269 5.269 0.535978 0.535978
Sub-Saharan Africa 撒哈拉以南非洲 4.150957 4.150957 4.1390 4.1390 2.693 2.693 5.648 5.648 0.584945 0.584945
Western Europe 西欧 6.693000 6.693000 6.9070 6.9070 4.857 4.857 7.587 7.587 0.777886 0.777886

Looks like some regions have extreme values that might affect our average more than we’d like them to. For example, Middle East and Northern Africa region have a high standard deviation, so we might want to remove extreme values. Let’s see how many values we’re calculating for each region. This might affect the representation we’re seeing. For example, Australia and new Zealand have a very low standard deviation and are ranked happiest for all three years, but we can also assume they only account for two countries.

看起来有些地区的极端价值可能会影响我们的平均水平,而不是我们希望的那样。 例如, Middle East and Northern Africa地区的标准差较高,因此我们可能要删除极值。 让我们看看我们正在为每个区域计算多少个值。 这可能会影响我们所看到的表示形式。 例如, Australia and new Zealand标准差非常低,在过去三年中排名最高,但是我们也可以假设它们仅占两个国家。

应用自定义函数删除异常值 (Applying a custom function to remove outliers)

pivot_table allows you to pass your own custom aggregation functions as arguments. You can either use a lambda function, or create a function. Let’s calculate the average number of countries in each region in a given year. We can do this easily using a lambda function, like so:

pivot_table允许您传递自己的自定义聚合函数作为参数。 您可以使用lambda函数,也可以创建一个函数。 让我们计算给定年份中每个区域的平均国家/地区数量。 我们可以使用lambda函数轻松完成此操作,如下所示:

pdpd .. pivot_tablepivot_table (( datadata , , index index = = 'Region''Region' , , valuesvalues == "Happiness Score""Happiness Score" ,
               ,
               aggfuncaggfunc = = [[ npnp .. meanmean , , minmin , , maxmax , , npnp .. stdstd , , lambda lambda xx : : xx .. countcount ()() // 33 ])
])
mean 意思 min max 最高 std 性病
Happiness Score 幸福分数 Happiness Score 幸福分数 Happiness Score 幸福分数 Happiness Score 幸福分数 Happiness Score 幸福分数
Region 地区
Australia and New Zealand 澳大利亚和新西兰 7.302500 7.302500 7.284 7.284 7.334 7.334 0.020936 0.020936 2.000000 2.000000
Central and Eastern Europe 中欧和东欧 5.371184 5.371184 4.096 4.096 6.609 6.609 0.578274 0.578274 29.000000 29.000000
Eastern Asia 东亚 5.632333 5.632333 4.874 4.874 6.422 6.422 0.502100 0.502100 6.000000 600万
Latin America and Caribbean 拉丁美洲和加勒比 6.069074 6.069074 3.603 3.603 7.226 7.226 0.728157 0.728157 22.666667 22.666667
Middle East and Northern Africa 中东和北非 5.387879 5.387879 3.006 3.006 7.278 7.278 1.031656 1.031656 19.333333 19.333333
North America 北美 7.227167 7.227167 6.993 6.993 7.427 7.427 0.179331 0.179331 2.000000 2.000000
Southeastern Asia 东南亚 5.364077 5.364077 3.819 3.819 6.798 6.798 0.882637 0.882637 8.666667 8.666667
Southern Asia 南亚 4.590857 4.590857 3.360 3.360 5.269 5.269 0.535978 0.535978 7.000000 700万
Sub-Saharan Africa 撒哈拉以南非洲 4.150957 4.150957 2.693 2.693 5.648 5.648 0.584945 0.584945 39.000000 39.000000
Western Europe 西欧 6.693000 6.693000 4.857 4.857 7.587 7.587 0.777886 0.777886 21.000000 21.000000

Both highest ranking regions with the lowest standard deviation only account for only two countries. Sub-Saharan Africa, on the other hand, has the lowest Happiness score, but it accounts for 43 countries. An interesting next step would be to remove extreme values from the calculation to see if the ranking changes significantly. Let’s create a function that only calculates the values that are between the 0.25th and 0.75th quantiles. We’ll use this function as a way to calculate the average for each region and check if the ranking stays the same or not.

具有最低标准偏差的两个排名最高的区域仅占两个国家。 另一方面, Sub-Saharan AfricaHappiness score最低,但是却占43个国家的一半。 有趣的下一步是从计算中删除极值,以查看排名是否发生重大变化。 让我们创建一个仅计算介于第0.25位和第0.75位之间的值的函数。 我们将使用此函数作为一种方法来计算每个区域的平均值,并检查排名是否保持不变。

mean 意思 remove_outliers remove_outliers
Happiness Score 幸福分数 Happiness Score 幸福分数 Happiness Score 幸福分数
Region 地区
Australia and New Zealand 澳大利亚和新西兰 7.302500 7.302500 7.299125 7.299125 2.000000 2.000000
Central and Eastern Europe 中欧和东欧 5.371184 5.371184 5.449250 5.449250 29.000000 29.000000
Eastern Asia 东亚 5.632333 5.632333 5.610125 5.610125 6.000000 600万
Latin America and Caribbean 拉丁美洲和加勒比 6.069074 6.069074 6.192750 6.192750 22.666667 22.666667
Middle East and Northern Africa 中东和北非 5.387879 5.387879 5.508500 5.508500 19.333333 19.333333
North America 北美 7.227167 7.227167 7.244875 7.244875 2.000000 2.000000
Southeastern Asia 东南亚 5.364077 5.364077 5.470125 5.470125 8.666667 8.666667
Southern Asia 南亚 4.590857 4.590857 4.707500 4.707500 7.000000 700万
Sub-Saharan Africa 撒哈拉以南非洲 4.150957 4.150957 4.128000 4.128000 39.000000 39.000000
Western Europe 西欧 6.693000 6.693000 6.846500 6.846500 21.000000 21.000000

Removing the outliers mostly affected the regions with a higher number of countries, which makes sense. We can see Western Europe (average of 21 countries surveyed per year) improved its ranking. Unfortunately, Sub-Saharan Africa (average of 39 countries surveyed per year) received an even lower ranking when we removed the outliers.

消除异常值主要影响了具有更多国家的区域,这是有道理的。 我们可以看到Western Europe (每年接受调查的平均21个国家/地区)的排名有所提高。 不幸的是,当我们剔除异常值时, Sub-Saharan Africa (每年接受调查的39个国家/地区)的排名甚至更低。

使用字符串操作进行分类 (Categorizing using string manipulation)

Up until now we’ve grouped our data according to the categories in the original table. However, we can search the strings in the categories to create our own groups. For example, it would be interesting to look at the results by continents. We can do this by looking for region names that contains Asia, Europe, etc. To do this, we can first assign our pivot table to a variable, and then add our filter:

到目前为止,我们已经根据原始表中的类别对数据进行了分组。 但是,我们可以搜索类别中的字符串以创建我们自己的组。 例如,按大洲查看结果将很有趣。 我们可以通过查找包含AsiaEurope等的区域名称来完成此操作。为此,我们可以先将数据透视表分配给变量,然后添加过滤器:

table table = = pdpd .. pivot_tablepivot_table (( datadata , , index index = = 'Region''Region' , , valuesvalues == "Happiness Score""Happiness Score" ,
               ,
               aggfuncaggfunc = = [[ npnp .. meanmean , , remove_outliersremove_outliers ])

])

tabletable [[ tabletable .. indexindex .. strstr .. containscontains (( 'Asia''Asia' )]
)]
mean 意思 remove_outliers remove_outliers
Happiness Score 幸福分数 Happiness Score 幸福分数
Region 地区
Eastern Asia 东亚 5.632333 5.632333 5.610125 5.610125
Southeastern Asia 东南亚 5.364077 5.364077 5.470125 5.470125
Southern Asia 南亚 4.590857 4.590857 4.707500 4.707500

Let’s see the results for Europe:

让我们看看Europe的结果:

mean 意思 remove_outliers remove_outliers
Happiness Score 幸福分数 Happiness Score 幸福分数
Region 地区
Central and Eastern Europe 中欧和东欧 5.371184 5.371184 5.44925 5.44925
Western Europe 西欧 6.693000 6.693000 6.84650 6.84650

The difference shows that the two European regions have a larger difference in happiness score. In most cases, removing outliers makes the score higher, but not in Eastern Asia.

差异表明,两个欧洲地区的幸福感得分差异较大。 在大多数情况下,除去异常值会使得分更高,但在东亚则不然。

If you’d like to extract specific values from more than one column, then it’s better to use df.query because the previous method won’t work for conditioning multi-indexes. For example, we can choose to view specific years, and specific regions in the Africa area.

如果要从多个列中提取特定值,则最好使用df.query因为前一种方法不适用于条件化多索引。 例如,我们可以选择查看非洲的特定年份和特定区域。

table table = = pdpd .. pivot_tablepivot_table (( datadata , , index index = = [[ 'Region''Region' , , 'Year''Year' ], ], valuesvalues == 'Happiness Score''Happiness Score' ,
               ,
               aggfuncaggfunc = = [[ npnp .. meanmean , , remove_outliersremove_outliers ])

])

tabletable .. queryquery (( 'Year == [2015, 2017] and Region == ["Sub-Saharan Africa", "Middle East and Northern Africa"]''Year == [2015, 2017] and Region == ["Sub-Saharan Africa", "Middle East and Northern Africa"]' )
)
mean 意思 remove_outliers remove_outliers
Happiness Score 幸福分数 Happiness Score 幸福分数
Region 地区 Year
Middle East and Northern Africa 中东和北非 2015 2015年 5.406900 5.406900 5.515875 5.515875
2017 2017年 5.369684 5.369684 5.425500 5.425500
Sub-Saharan Africa 撒哈拉以南非洲 2015 2015年 4.202800 4.202800 4.168375 4.168375
2017 2017年 4.111949 4.111949 4.118000 4.118000

In this example the differences are minor, but an interesting exercise would be to compare information from previous years since the survey has reports since 2012.

在此示例中,差异很小,但是一个有趣的练习是比较前几年的信息,因为该调查自2012年以来都有报告。

处理丢失的数据 (Handling missing data)

We’ve covered the most powerful parameters of pivot_table thus far, so you can already get a lot out of it if you go experiment using this method on your own project. Having said that, it’s useful to quickly go through the remaining parameters (which are all optional and have default values). The first thing to talk about is missing values.

到目前为止,我们已经介绍了数据pivot_table最强大的参数,因此,如果您在自己的项目中使用此方法进行实验,则已经可以从中pivot_table 。 话虽如此,快速浏览其余参数(都是可选参数并具有默认值)很有用。 首先要谈的是缺失值。

  • dropna is type boolean, and used to indicate you do not want to include columns whose entries are all NaN (default: True)
  • fill_value is type scalar, and used to choose a value to replace missing values (default: None).
  • dropna是布尔类型,用于表示您不想包括所有条目均为NaN列(默认值:True)
  • fill_value是标量类型,用于选择一个值来替换缺少的值(默认值:无)。

We don’t have any columns where all entries are NaN, but it’s worth knowing that if we did pivot_table would drop them by default according to dropna definition.

我们没有所有条目均为NaN列,但是值得一提的是,如果我们这样做, pivot_table将根据dropna定义默认删除它们。

We have been letting pivot_table treat our NaN’s according to the default settings. The fill_value default value is None so this means we didn’t replace missing values in our Data set. To demonstrate this we’ll need to produce a pivot table with NaN values. We can split the Happiness Score of each region into three quantiles, and check how many countries fall into each of the three quantiles (hoping at least one of the quantiles will have missing values in it).

我们一直在根据默认设置让pivot_table处理我们的NaNfill_value默认值为None ,这意味着我们没有替换数据集中缺少的值。 为了证明这一点,我们需要生成一个具有NaN值的数据透视表。 我们可以将每个区域的Happiness Score分为三个分位数,并检查有多少国家属于这三个分位数(希望至少一个分位数中会有缺失值)。

To do this, we’ll use qcut(), which is a built-in pandas function that allows you to split your data into any number of quantiles you choose. For example, specifying pd.qcut(data["Happiness Score"], 4) will result in four quantiles:

为此,我们将使用qcut() ,它是内置的pandas函数,可让您将数据拆分为任意数量的分位数。 例如,指定pd.qcut(data["Happiness Score"], 4)将产生四个分位数:

  • 0-25%
  • 25%-50%
  • 50%-75%
  • 75%-100%
  • 0-25%
  • 25%-50%
  • 50%-75%
  • 75%-100%
Happiness Score 幸福分数
Region 地区 Happiness Score 幸福分数
Australia and New Zealand 澳大利亚和新西兰 (2.692, 4.509] (2.692,4.509] NaN N
(4.509, 5.283] (4.509,5.283] NaN N
(5.283, 6.234] (5.283,6.234] NaN N
(6.234, 7.587] (6.234,7.587] 6.0 6.0
Central and Eastern Europe 中欧和东欧 (2.692, 4.509] (2.692,4.509] 10.0 10.0
(4.509, 5.283] (4.509,5.283] 28.0 28.0
(5.283, 6.234] (5.283,6.234] 46.0 46.0
(6.234, 7.587] (6.234,7.587] 3.0 3.0
Eastern Asia 东亚 (2.692, 4.509] (2.692,4.509] NaN N

Regions where there are no countries in a specific quantile show NaN. This isn’t ideal because a count that equals NaN doesn’t give us any useful information. It’s less confusing to display 0, so let’s substitute NaN by zeros using fill_value:

在特定分位数中没有国家的区域显示NaN 。 这不是理想的,因为等于NaN的计数不会提供任何有用的信息。 显示0不太容易混淆,因此让我们使用fill_valueNaN替换为零:

# splitting the happiness score into 3 quantiles
# splitting the happiness score into 3 quantiles
score score = = pdpd .. qcutqcut (( datadata [[ "Happiness Score""Happiness Score" ], ], 33 )
)
pdpd .. pivot_tablepivot_table (( datadata , , indexindex = = [[ 'Region''Region' , , scorescore ], ], valuesvalues = = "Happiness Score""Happiness Score" , , aggfuncaggfunc = = 'count''count' ,
              ,
              fill_valuefill_value = = 00 )
)
Happiness Score 幸福分数
Region 地区 Happiness Score 幸福分数
Australia and New Zealand 澳大利亚和新西兰 (2.692, 4.79] (2.692,4.79] 0 0
(4.79, 5.895] (4.79,5.895] 0 0
(5.895, 7.587] (5.895,7.587] 6 6
Central and Eastern Europe 中欧和东欧 (2.692, 4.79] (2.692,4.79] 15 15
(4.79, 5.895] (4.79,5.895] 58 58
(5.895, 7.587] (5.895,7.587] 14 14
Eastern Asia 东亚 (2.692, 4.79] (2.692,4.79] 0 0
(4.79, 5.895] (4.79,5.895] 11 11
(5.895, 7.587] (5.895,7.587] 7 7
Latin America and Caribbean 拉丁美洲和加勒比 (2.692, 4.79] (2.692,4.79] 4 4
(4.79, 5.895] (4.79,5.895] 19 19
(5.895, 7.587] (5.895,7.587] 45 45
Middle East and Northern Africa 中东和北非 (2.692, 4.79] (2.692,4.79] 18 18
(4.79, 5.895] (4.79,5.895] 20 20
(5.895, 7.587] (5.895,7.587] 20 20
North America 北美 (2.692, 4.79] (2.692,4.79] 0 0
(4.79, 5.895] (4.79,5.895] 0 0
(5.895, 7.587] (5.895,7.587] 6 6
Southeastern Asia 东南亚 (2.692, 4.79] (2.692,4.79] 6 6
(4.79, 5.895] (4.79,5.895] 12 12
(5.895, 7.587] (5.895,7.587] 8 8
Southern Asia 南亚 (2.692, 4.79] (2.692,4.79] 13 13
(4.79, 5.895] (4.79,5.895] 8 8
(5.895, 7.587] (5.895,7.587] 0 0
Sub-Saharan Africa 撒哈拉以南非洲 (2.692, 4.79] (2.692,4.79] 101 101
(4.79, 5.895] (4.79,5.895] 16 16
(5.895, 7.587] (5.895,7.587] 0 0
Western Europe 西欧 (2.692, 4.79] (2.692,4.79] 0 0
(4.79, 5.895] (4.79,5.895] 12 12
(5.895, 7.587] (5.895,7.587] 51 51

添加总行数/列数 (Adding total rows/columns)

The last two parameters are both optional and mostly useful to improve display:

最后两个参数都是可选的,并且对于改善显示效果最有用:

  • margins is type boolean and allows you to add an all row / columns, e.g. for subtotal / grand totals (Default False)
  • margins_name which is type string and accepts the name of the row / column that will contain the totals when margins is True (default ‘All’)
  • margins是布尔类型,允许您添加all行/列,例如小计/总计(默认为False)
  • margins_name是字符串类型,当margins为True时,将接受将包含总计的行/列的名称(默认为“ All”)

Let’s use these to add a total to our last table.

让我们使用这些将总计添加到我们的上一张表中。

Happiness Score 幸福分数
Region 地区 Happiness Score 幸福分数
Australia and New Zealand 澳大利亚和新西兰 (2.692, 4.79] (2.692,4.79] 0.0 0.0
(4.79, 5.895] (4.79,5.895] 0.0 0.0
(5.895, 7.587] (5.895,7.587] 6.0 6.0
Central and Eastern Europe 中欧和东欧 (2.692, 4.79] (2.692,4.79] 15.0 15.0
(4.79, 5.895] (4.79,5.895] 58.0 58.0
(5.895, 7.587] (5.895,7.587] 14.0 14.0
Eastern Asia 东亚 (2.692, 4.79] (2.692,4.79] 0.0 0.0
(4.79, 5.895] (4.79,5.895] 11.0 11.0
(5.895, 7.587] (5.895,7.587] 7.0 7.0
Latin America and Caribbean 拉丁美洲和加勒比 (2.692, 4.79] (2.692,4.79] 4.0 4.0
(4.79, 5.895] (4.79,5.895] 19.0 19.0
(5.895, 7.587] (5.895,7.587] 45.0 45.0
Middle East and Northern Africa 中东和北非 (2.692, 4.79] (2.692,4.79] 18.0 18.0
(4.79, 5.895] (4.79,5.895] 20.0 20.0
(5.895, 7.587] (5.895,7.587] 20.0 20.0
North America 北美 (2.692, 4.79] (2.692,4.79] 0.0 0.0
(4.79, 5.895] (4.79,5.895] 0.0 0.0
(5.895, 7.587] (5.895,7.587] 6.0 6.0
Southeastern Asia 东南亚 (2.692, 4.79] (2.692,4.79] 6.0 6.0
(4.79, 5.895] (4.79,5.895] 12.0 12.0
(5.895, 7.587] (5.895,7.587] 8.0 8.0
Southern Asia 南亚 (2.692, 4.79] (2.692,4.79] 13.0 13.0
(4.79, 5.895] (4.79,5.895] 8.0 8.0
(5.895, 7.587] (5.895,7.587] 0.0 0.0
Sub-Saharan Africa 撒哈拉以南非洲 (2.692, 4.79] (2.692,4.79] 101.0 101.0
(4.79, 5.895] (4.79,5.895] 16.0 16.0
(5.895, 7.587] (5.895,7.587] 0.0 0.0
Western Europe 西欧 (2.692, 4.79] (2.692,4.79] 0.0 0.0
(4.79, 5.895] (4.79,5.895] 12.0 12.0
(5.895, 7.587] (5.895,7.587] 51.0 51.0
Total count 总数 470.0 470.0

让我们总结一下 (Let’s summarize)

翻译自: https://www.pybloggers.com/2017/09/explore-happiness-data-using-python-pivot-tables/

python 数据透视表

你可能感兴趣的:(python,数据分析,机器学习,大数据,人工智能)