修订翻译《利用Python进行数据分析·第2版》第13章 高级pandas

前几章重点讲解了不同类型的数据规整工作流以及NumPy、pandas和其他库的特性。随着时间的推移,pandas为高级用户开发了更具深度的特性。本章将深入讲解几个更高级的特性领域,以帮助您加深作为pandas用户的专业知识。
The preceding chapters have focused on introducing different types of data wrangling workflows and features of NumPy, pandas, and other libraries. Over time, pandas has developed a depth of features for power users. This chapter digs into a few more advanced feature areas to help you deepen your expertise as a pandas user.

12.1 分类数据

12.1 Categorical Data

本节讲解pandas的Categorical类型。我将展示如何使用它在一些pandas操作中获得更高的性能和更好的内存使用。我还将讲解一些在统计学应用场景和机器学习应用场景中使用分类数据的工具。
This section introduces the pandas Categorical type. I will show how you can achieve better performance and memory use in some pandas operations by using it. I also introduce some tools for using categorical data in statistics and machine learning applications.

12.1.1 背景和动机

Background and Motivation

通常,表中的一列可能包含较小不同值集合的重复实例。我们已经看到了pandas.unique函数pandas.value counts函数能够从数组中提取不同值并分别计算它们的频数:
Frequently, a column in a table may contain repeated instances of a smaller set of distinct values. We have already seen functions like unique and value_counts, which enable us to extract the distinct values from an array and compute their frequencies, respectively:

import numpy as np
import pandas as pd
np.random.seed(12345)
import matplotlib.pyplot as plt
plt.rc('figure', figsize=(10, 6))
PREVIOUS_MAX_ROWS = pd.options.display.max_rows
pd.options.display.max_rows = 20
np.set_printoptions(precision=4, suppress=True)

In [10]: import numpy as np; import pandas as pd

In [11]: vals = pd.Series(['apple', 'orange', 'apple',
   ....:                     'apple'] * 2) # gg注:为避免歧义,变量名从原文的values改为vals

In [12]: vals
Out[12]: 
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
dtype: object

In [13]: pd.unique(vals)
Out[13]: array(['apple', 'orange'], dtype=object)

In [14]: pd.value_counts(vals)
Out[14]: 
apple     6
orange    2
dtype: int64

很多数据系统(用于数据仓库、统计计算或其它用途)都开发了专门的途径来表示带有重复值的数据,以便更高效的存储和计算。在数据仓库中,最佳做法是使用包含不同值的维表(dimension table),并将主要观察结果存储为引用维表的整数键:
Many data systems (for data warehousing, statistical computing, or other uses) have developed specialized approaches for representing data with repeated values for more efficient storage and computation. In data warehousing, a best practice is to use socalled dimension tables containing the distinct values and storing the primary observations as integer keys referencing the dimension table:

In [15]: vals = pd.Series([0, 1, 0, 0] * 2)

In [16]: dim = pd.Series(['apple', 'orange'])

In [17]: vals
Out[17]: 
0    0
1    1
2    0
3    0
4    0
5    1
6    0
7    0
dtype: int64

In [18]: dim
Out[18]: 
0     apple
1    orange
dtype: object

我们可以使用take方法来恢复原始的字符串Series:
We can use the take method to restore the original Series of strings:

In [19]: dim.take(vals)
Out[19]: 
0     apple
1    orange
0     apple
0     apple
0     apple
1    orange
0     apple
0     apple
dtype: object

这种表示为整数的方式被称为分类表示法(categorical representation)或字典编码表示法(dictionary-encoded representation)。由不同值组成的数组可以被称为数据的类别(categories)字典(dictionary)级别(levels)。在本书中,我们将使用术语分类的(categorical)类别(categories)。引用类别的整数值被称为类别编码(category codes)或简称编码(codes)
This representation as integers is called the categorical or dictionary-encoded representation. The array of distinct values can be called the categories, dictionary, or levels of the data. In this book we will use the terms categorical and categories. The integer values that reference the categories are called the category codes or simply codes.

在进行分析时,分类表示法可以产生明显的性能提升。你也可以在不修改编码的情况下对类别执行变换。 一些可以以相对较低的成本执行的示例变换是:
• 重命名类别
• 在不改变现有类别的顺序或位置的情况下追加新类别
The categorical representation can yield significant performance improvements when you are doing analytics. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:
• Renaming categories
• Appending a new category without changing the order or position of the existing categories

12.1.2 pandas中的Categorical类型

Categorical Type in pandas

pandas有一个特殊的Categorical类型,用于保存基于整数的分类表示法的数据。让我们考虑一下前面的示例Series:
pandas has a special Categorical type for holding data that uses the integer-based categorical representation or encoding. Let’s consider the example Series from before:

In [20]: fruits = ['apple', 'orange', 'apple', 'apple'] * 2

In [21]: N = len(fruits)

In [22]: df = pd.DataFrame({'fruit': fruits,
   ....:                    'basket_id': np.arange(N),
   ....:                    'count': np.random.randint(3, 15, size=N),
   ....:                    'weight': np.random.uniform(0, 4, size=N)},
   ....:                   columns=['basket_id', 'fruit', 'count', 'weight'])

In [23]: df
Out[23]: 
   basket_id   fruit  count    weight
0          0   apple      5  3.858058
1          1  orange      8  2.612708
2          2   apple      4  2.995627
3          3   apple      7  2.614279
4          4   apple     12  2.990859
5          5  orange      8  3.845227
6          6   apple      5  0.033553
7          7   apple      4  0.425778

这里,df['fruit']的值是由Python字符串对象组成的数组。我们可以将df['fruit']转换为分类的:
Here, df['fruit'] is an array of Python string objects. We can convert it to categorical by calling:

In [24]: fruit_cat = df['fruit'].astype('category')

In [25]: fruit_cat
Out[25]: 
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

fruit_cat的值不是一个NumPy数组,而是一个pandas.Categorical实例:
The values for fruit_cat are not a NumPy array, but an instance of pandas.Categorical:

In [26]: c = fruit_cat.values

In [27]: type(c)
Out[27]: pandas.core.categorical.Categorical

pandas.Categorical对象categories属性codes属性
The Categorical object has categories and codes attributes:

In [28]: c.categories
Out[28]: Index(['apple', 'orange'], dtype='object')

In [29]: c.codes
Out[29]: array([0, 1, 0, 0, 0, 1, 0, 0], dtype=int8)

通过赋值转换后的结果,你可以将DataFrame的列转换为分类的:
You can convert a DataFrame column to categorical by assigning the converted result:

In [30]: df['fruit'] = df['fruit'].astype('category')

In [31]: df.fruit
Out[31]:
0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: category
Categories (2, object): [apple, orange]

你也可以直接从其它类型的Python序列创建pandas.Categorical对象
You can also create pandas.Categorical directly from other types of Python sequences:

In [32]: my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])

In [33]: my_categories
Out[33]: 
[foo, bar, baz, foo, bar]
Categories (3, object): [bar, baz, foo]

如果你从其它来源获得了分类编码后的数据,你还可以使用pandas.Categorical.from_codes构造函数:
If you have obtained categorical encoded data from another source, you can use the alternative from_codes constructor:

In [34]: ca = ['foo', 'bar', 'baz'] # gg注:为避免歧义,变量名从原文的categories改为ca

In [35]: co = [0, 1, 2, 0, 0, 1] # gg注:为避免歧义,变量名从原文的codes改为co

In [36]: my_cats_2 = pd.Categorical.from_codes(codes=co, categories=ca)

In [37]: my_cats_2
Out[37]: 
[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo, bar, baz]

除非显式指定,否则分类转换不假定类别的特定顺序。因此,根据输入数据的排序,categories数组可能有不同的顺序。当使用pandas.Categorical.from_codes或任何其它构造函数时,你可以指定类别具有有意义的排序:
Unless explicitly specified, categorical conversions assume no specific ordering of the categories. So the categories array may be in a different order depending on the ordering of the input data. When using from_codes or any of the other constructors, you can indicate that the categories have a meaningful ordering:

In [38]: ordered_cat = pd.Categorical.from_codes(codes=co, categories=ca,
   ....:                                         ordered=True)

In [39]: ordered_cat
Out[39]: 
[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

输出的[foo < bar < baz]表示在排序中'foo'在'bar'之前,以此类推。一个无序的pandas.Categorical实例可以通过as-ordered方法进行排序:
The output [foo < bar < baz] indicates that 'foo' precedes 'bar' in the ordering, and so on. An unordered categorical instance can be made ordered with as_ordered:

In [40]: my_cats_2.as_ordered()
Out[40]: 
[foo, bar, baz, foo, foo, bar]
Categories (3, object): [foo < bar < baz]

最后一点,分类数据可以不是字符串,尽管我只展示了字符串示例。分类数组可以包括任何不可变值类型。
As a last note, categorical data need not be strings, even though I have only showed string examples. A categorical array can consist of any immutable value types.

12.1.3 使用Categorical对象进行计算

Computations with Categoricals

与非编码版本(例如字符串数组)相比,在pandas中使用pandas.Categorical对象通常会有相同的行为。pandas中的某些组件(例如groupby方法)在处理Categorical对象时表现更好。还有一些函数可以利用ordered标志。
Using Categorical in pandas compared with the non-encoded version (like an array of strings) generally behaves the same way. Some parts of pandas, like the groupby function, perform better when working with categoricals. There are also some functions that can utilize the ordered flag.

让我们考虑一些随机数值数据,并使用pandas.qcut分箱函数,这返回pandas.Categorical对象。我们之前使用过pandas.cut函数,但略过了pandas.Categorical对象如何工作的细节:
Let’s consider some random numeric data, and use the pandas.qcut binning function. This return pandas.Categorical; we used pandas.cut earlier in the book but glossed over the details of how categoricals work:

In [41]: np.random.seed(12345)

In [42]: draws = np.random.randn(1000)

In [43]: draws[:5]
Out[43]: array([-0.2047,  0.4789, -0.5194, -0.5557,  1.9658])

让我们计算该数据的四分位数分箱,并提取一些统计量:
Let’s compute a quartile binning of this data and extract some statistics:

In [44]: bs = pd.qcut(draws, 4) # gg注:为避免歧义,变量名从原文的bins改为bs

In [45]: bs
Out[45]: 
[(-0.684, -0.0101], (-0.0101, 0.63], (-0.684, -0.0101], (-0.684, -0.0101], (0.63,
 3.928], ..., (-0.0101, 0.63], (-0.684, -0.0101], (-2.95, -0.684], (-0.0101, 0.63
], (0.63, 3.928]]
Length: 1000
Categories (4, interval[float64]): [(-2.95, -0.684] < (-0.684, -0.0101] < (-0.010
1, 0.63] <
                                    (0.63, 3.928]]

虽然精确的样本四分位数有用,但在生成报告时可能不如四分位数名称有用。我们可以通过pandas.qcut函数labels参数来实现这一点:
While useful, the exact sample quartiles may be less useful for producing a report than quartile names. We can achieve this with the labels argument to qcut:

In [46]: bs = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4']) # gg注:为避免歧义,变量名从原文的bins改为bs

In [47]: bs
Out[47]: 
[Q2, Q3, Q2, Q2, Q4, ..., Q3, Q2, Q1, Q3, Q4]
Length: 1000
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

In [48]: bs.codes[:10]
Out[48]: array([1, 2, 1, 1, 3, 3, 2, 2, 3, 3], dtype=int8)

被标记的bs pandas.Categorical对象不包含关于数据中箱边缘的信息,因此我们可以使用groupby方法提取一些汇总统计:
The labeled bs categorical does not contain information about the bin edges in the data, so we can use groupby to extract some summary statistics:

In [49]: bs_s = pd.Series(bs, name='quartile') # gg注:为避免歧义,变量名从原文的bins改为bs_s

In [50]: results = (pd.Series(draws)
   ....:            .groupby(bs_s)
   ....:            .agg(['count', 'min', 'max'])
   ....:            .reset_index())

In [51]: results
Out[51]: 
  quartile  count       min       max
0       Q1    250 -2.949343 -0.685484
1       Q2    250 -0.683066 -0.010115
2       Q3    250 -0.010032  0.628894
3       Q4    250  0.634238  3.927528

结果中的‘quartile’列保留了来自bs的原始分类信息,包括排序:
The 'quartile' column in the result retains the original categorical information, including ordering, from bs:

In [52]: results['quartile']
Out[52]:
0    Q1
1    Q2
2    Q3
3    Q4
Name: quartile, dtype: category
Categories (4, object): [Q1 < Q2 < Q3 < Q4]

12.1.3.1 使用Categorical对象提高性能

Better performance with categoricals

如果你是在一个特定的数据集上做大量分析,那么转换为分类的可以产生大量的整体性能提升。DataFrame列的分类版本常常也会使用明显更少的内存。让我们考虑具有一千万个元素和少量不同类别的Series对象:
If you do a lot of analytics on a particular dataset, converting to categorical can yield substantial overall performance gains. A categorical version of a DataFrame column will often use significantly less memory, too. Let’s consider some Series with 10 million elements and a small number of distinct categories:

In [53]: N = 10000000

In [54]: draws = pd.Series(np.random.randn(N))

In [55]: lbs = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))  # gg注:为避免歧义,变量名从原文的labels改为lbs

现在我们将lbs转换为分类的:
Now we convert lbs to categorical:

In [56]: cat_s = lbs.astype('category') # gg注:为避免歧义,变量名从原文的categories改为cat_s

现在我们注意到lbscat_s使用明显更多的内存:
Now we note that lbs uses significantly more memory than cat_s:

In [57]: lbs.memory_usage()
Out[57]: 80000080

In [58]: cat_s.memory_usage()
Out[58]: 10000272

当然,转换为分类的不是免费的,但它是一次性的成本:
The conversion to category is not free, of course, but it is a one-time cost:

In [59]: %time _ = lbs.astype('category')
CPU times: user 490 ms, sys: 240 ms, total: 730 ms
Wall time: 726 ms

因为底层算法使用基于整数的编码数组而不是字符串数组,所以使用pandas.Categorical对象的GroupBy操作明显更快。
GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.

12.1.4 分类方法

Categorical Methods

类似于Series.str专用字符串方法,包含分类数据的Series对象有几个特殊的方法。这也提供了对categories属性和codes属性方便访问的方式。考虑以下Series对象:
Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes. Consider the Series:

In [60]: s = pd.Series(['a', 'b', 'c', 'd'] * 2)

In [61]: cat_s = s.astype('category')

In [62]: cat_s
Out[62]: 
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): [a, b, c, d]

特殊的cat属性提供了对分类方法的访问方式:
The special attribute cat provides access to categorical methods:

In [63]: cat_s.cat.codes
Out[63]: 
0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [64]: cat_s.cat.categories
Out[64]: Index(['a', 'b', 'c', 'd'], dtype='object')

假设我们知道该数据的实际类别集合超出了数据中观察到的四个值。我们可以使用set categories方法来改变它们:
Suppose that we know the actual set of categories for this data extends beyond the four values observed in the data. We can use the set_categories method to change them:

In [65]: actual_categories = ['a', 'b', 'c', 'd', 'e']

In [66]: cat_s2 = cat_s.cat.set_categories(actual_categories)

In [67]: cat_s2
Out[67]: 
0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): [a, b, c, d, e]

虽然数据看起来没改变,但新类别将反映在使用它们的操作中。例如,value_counts方法将遵循新的类别(如果存在的话):
While it appears that the data is unchanged, the new categories will be reflected in operations that use them. For example, value_counts respects the categories, if present:

In [68]: cat_s.value_counts()
Out[68]: 
d    2
c    2
b    2
a    2
dtype: int64

In [69]: cat_s2.value_counts()
Out[69]: 
d    2
c    2
b    2
a    2
e    0
dtype: int64

在大型数据集中,pandas.Categorical对象经常被用作节省内存和提高性能的方便工具。在筛选一个大型DataFrame或Series之后,很多类别可能不会出现在数据中。为此,我们可以使用remove_unused_categories方法来删除未观察到的类别:
In large datasets, categoricals are often used as a convenient tool for memory savings and better performance. After you filter a large DataFrame or Series, many of the categories may not appear in the data. To help with this, we can use the remove_unused_categories method to trim unobserved categories:

In [70]: cat_s3 = cat_s[cat_s.isin(['a', 'b'])]

In [71]: cat_s3
Out[71]: 
0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): [a, b, c, d]

In [72]: cat_s3.cat.remove_unused_categories()
Out[72]: 
0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): [a, b]

可用的分类方法清单见表12-1。
See Table 12-1 for a listing of available categorical methods.

表12-1:pandas中Series的分类方法

Table 12-1. Categorical methods for Series in pandas

12.1.4.1 为建模创建虚拟变量

Creating dummy variables for modeling

当你使用统计学或机器学习工具是,通常会将分类数据变换为虚拟变量(dummy variable),也被称为独热编码(one-hot encoding)。这涉及到创建一个DataFrame,每个不同类别都是它的一列。当出现给定类别这些列的数值为1,否则为0。
When you’re using statistics or machine learning tools, you’ll often transform categorical data into dummy variables, also known as one-hot encoding. This involves creating a DataFrame with a column for each distinct category; these columns contain 1s for occurrences of a given category and 0 otherwise.

考虑前面的例子:
Consider the previous example:

In [73]: cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')

如前文第7章所述,pandas.get_dummies函数将该一维分类数据转换为包含虚拟变量的DataFrame对象:
As mentioned previously in Chapter 7, the pandas.get_dummies function converts this one-dimensional categorical data into a DataFrame containing the dummy variable:

In [74]: pd.get_dummies(cat_s)
Out[74]: 
   a  b  c  d
0  1  0  0  0
1  0  1  0  0
2  0  0  1  0
3  0  0  0  1
4  1  0  0  0
5  0  1  0  0
6  0  0  1  0
7  0  0  0  1

你可能感兴趣的:(修订翻译《利用Python进行数据分析·第2版》第13章 高级pandas)