Python 数据处理（二十八）—— MultiIndex 分层索引

前言

本节将介绍使用多级索引（分层索引）和其他高级索引技巧

分层索引（MultiIndex）

分层或多级次索引的存在是非常有意义的，因为它打开了复杂的数据分析和操作的大门，特别是处理高维数据

从本质上讲，它允许您在较低维度的数据结构包括 Series(1d) 和 DataFrame(2d) 中存储和操作任意维度的数据

在本节中，我们将展示分层索引的确切含义，以及如何结合前面介绍的所有 pandas 索引功能使用

在 0.24.0 版本之后，MultiIndex.label 重命名为 MultiIndex.codes。MultiIndex.set_labels 重命名为 MultiIndex.set_codes.

1 创建一个 MultiIndex（分级索引）对象

MultiIndex 对象是标准 Index 对象的分层模式，它通常在 pandas 对象中存储轴标签

你可以把 MultiIndex 看成一个元组数组，其中每个元组都是唯一的。MultiIndex 有如下创建方式

MultiIndex.from_arrays()：传入一个数组列表
MultiIndex.from_tuples()：传入一个元组数组、
MultiIndex.from_product()：传入一个交叉的迭代集合
MultiIndex.from_frame()：传入一个 DataFrame

当传递给 Index 构造函数一个元组列表时，它将尝试返回一个 MultiIndex。

下面的示例演示了初始化 MultiIndex 的不同方法。

In [1]: arrays = [
   ...:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ...:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ...: ]
   ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index
Out[5]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]: 
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64

当你想要对两个可迭代对象中的每个元素进行两两配对时，可以使用 MultiIndex.from_product()

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

您也可以使用 MultiIndex.from_frame() 方法直接从 DataFrame 中构造一个 MultiIndex。

In [10]: df = pd.DataFrame(
   ....:     [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
   ....:     columns=["first", "second"],
   ....: )
   ....: 

In [11]: pd.MultiIndex.from_frame(df)
Out[11]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

为了方便起见，你可以直接将数组列表传递给 Series 或 DataFrame 的 index 参数来自动构造一个 MultiIndex

In [12]: arrays = [
   ....:     np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
   ....:     np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
   ....: ]
   ....: 

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
Out[14]: 
bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
Out[16]: 
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

所有的 MultiIndex 构造函数都接受一个 name 参数，该参数存储索引级别的名称。如果没有设置，则值为 None

In [17]: df.index.names
Out[17]: FrozenList([None, None])

索引可以放在任何轴上，索引的层级也可以随你设置

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df
Out[19]: 
first        bar                 baz                 foo                 qux          
second       one       two       one       two       one       two       one       two
A       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]: 
first              bar                 baz                 foo          
second             one       two       one       two       one       two
first second                                                            
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

这已经简化了较高层次的索引，使控制台的输出更容易看清。

注意，索引的显示方式可以通过 pandas.set_options() 中的 multi_sparse 选项来控制。

In [21]: with pd.option_context("display.multi_sparse", False):
   ....:     df
   ....:

值得注意的是，将元组用作轴上的原子标签也是可以的

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]: 
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64

多索引之所以重要，是因为它允许您执行分组、选择和重塑操作，我们将在下面以及后续部分中描述这些操作

2 重建级别标签

get_level_values() 方法能够返回特定级别的标签向量

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

3 使用 MultiIndex 在轴上进行基本索引

分级索引的一个重要特性是，您可以通过标识数据中的子组的部分标签来选择数据

部分选择以一种完全类似于在常规 DataFrame 中选择列的方式，返回的结果会 "降低" 分层索引的级别

In [25]: df["bar"]
Out[25]: 
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df["bar", "one"]
Out[26]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"]
Out[27]: 
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s["qux"]
Out[28]: 
one   -1.039575
two    0.271860
dtype: float64

4 定义级别

MultiIndex 会保留索引的所有已经定义了的级别，尽管它们实际上可能并没有被使用。

在对索引进行切片时，您可能会注意到这一点。例如

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

这样做的目的是为了避免重新计算级别，以提高切片的性能。如果你只想看某一级别，可以使用 get_level_values() 方法

In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]: 
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

可以使用 remove_unused_levels() 方法重构 MultiIndex

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

5 数据对齐和 reindex

在具有 MultiIndex 的不同索引对象之间的操作会自动对齐

In [35]: s + s[:-2]
Out[35]: 
bar  one   -1.723698
     two   -4.209138
baz  one   -0.989859
     two    2.143608
foo  one    1.443110
     two   -1.413542
qux  one         NaN
     two         NaN
dtype: float64

In [36]: s + s[::2]
Out[36]: 
bar  one   -1.723698
     two         NaN
baz  one   -0.989859
     two         NaN
foo  one    1.443110
     two         NaN
qux  one   -2.079150
     two         NaN
dtype: float64

Series/DataFrames 的 reindex() 方法可以传入一个 MultiIndex，甚至可以是一个元组列表或元组数组

In [37]: s.reindex(index[:3])
Out[37]: 
first  second
bar    one      -0.861849
       two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]: 
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64