pandas 学习心得(3):层级索引

阅读体验不佳(与有道云笔记的markdown解析不同)，因此建议进入传送门

jupyter notebook:pandas 学习心得(3):层级索引

这个系列是我学习《python数据科学手册》所做的笔记

用于个人备忘

顺便分享，因此存在不严谨的地方或者述说不清晰的地方

Series多级索引

import numpy as np
import pandas as pd

多级索引的作用: 用低维的Series 或 DataFrame 表示更高维的数据
首先在不知道pandas 提供多级索引的条件下，创造一个Series 数据集

index= {('California', 2000),('California',2010),
        ('New York',2000),('New York',2010),
        ('Texas',2000),('Texas',2010)}
populations = [33871648,37253956,
              18976457,19378102,
              20851820,25145561]
pop = pd.Series(populations, index = index)
pop

Texas       2000    33871648
New York    2000    37253956
            2010    18976457
California  2010    19378102
Texas       2010    20851820
California  2000    25145561
dtype: int64

查看我们设置的索引长什么样子

index

{('California', 2000),
 ('California', 2010),
 ('New York', 2000),
 ('New York', 2010),
 ('Texas', 2000),
 ('Texas', 2010)}

这是有用元组构成的多级索引，应用起来诸多不便
而且，上面pop 两个California 怎么不挨在一起，强迫症受不了！

pandas 多级索引
现在我们利用笛卡儿积生成多级索引

index = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])  
index

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

接着，我们将pop的索引进行重置，就能看到层级索引了

pop = pop.reindex(index)
pop

California  2000    25145561
            2010    19378102
New York    2000    37253956
            2010    18976457
Texas       2000    33871648
            2010    20851820
dtype: int64

好看多了，其中最左边的索引为 0级索引，2000这些为1级索引，以此类推。
这个对象还是一个Series序列

现在，可以直接利用第二个索引，获取2010年的全部数据了

pop[:,2010]  # [a,b] a表示 California 这些地名，b 表示2000这些年份

California    19378102
New York      18976457
Texas         20851820
dtype: int64

多级索引的创建方法

1. 显式地创建多级索引

pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])  # 从简单数组中创建

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])  # 从元组中创建

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])  # 从笛卡尔积中创建，已经了解过了

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

更详细地，可以直接提供levels和 labels 进行创建

pd.MultiIndex(levels = [['a','b'],[1,2]],
             labels = [[0,0,1,1],[0,1,0,1]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

levels 有两个列表，分别表示第0级索引和第1级索引
labels 也有两个列表，两个列表长度(数据集元素的个数)相同，分表表示数据取自第0级索引和 1级索引的第几个标签，结合笛卡尔积理解

给多级索引加上名称，可以方便管理

pop.index.names = ['states','years']
pop

states      years
California  2000     25145561
            2010     19378102
New York    2000     37253956
            2010     18976457
Texas       2000     33871648
            2010     20851820
dtype: int64

2. 多级列索引

对于DataFrame,有多级行索引，就存在多级列索引
下面模拟一个医疗数据的 DataFrame

index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

data = np.round(np.random.randn(4,6), 1)
data[:,::2] *= 10
data += 37
data

array([[30. , 38.4, 43. , 35.8, 34. , 36. ],
       [35. , 36. , 31. , 36.6, 24. , 36.7],
       [31. , 36.6, 47. , 37.2, 37. , 39. ],
       [39. , 35.7, 40. , 36.4, 37. , 36.9]])

health_data = pd.DataFrame(data, index = index, columns = columns)
health_data

	subject	Bob		Guido		Sue
	type	HR	Temp	HR	Temp	HR	Temp
year	visit
2013	1	30.0	38.4	43.0	35.8	34.0	36.0
2013	2	35.0	36.0	31.0	36.6	24.0	36.7
2014	1	31.0	36.6	47.0	37.2	37.0	39.0
2014	2	39.0	35.7	40.0	36.4	37.0	36.9

对DataFrame提供一个索引，只能查询第0级列索引

health_data['Guido']  # health_data['HR']  会报错

	type	HR	Temp
year	visit
2013	1	43.0	35.8
2013	2	31.0	36.6
2014	1	47.0	37.2
2014	2	40.0	36.4

多级索引的取值操作

1. Series 多级索引

以pop 数据集为例

pop

states      years
California  2000     25145561
            2010     19378102
New York    2000     37253956
            2010     18976457
Texas       2000     33871648
            2010     20851820
dtype: int64

pop['California',2000]  # 注意各级索引的位置

25145561

pop['California']  # 如果只提供一个，不加逗号，那么只能在 0级索引中挑选，pop[2010] 报错

years
2000    25145561
2010    19378102
dtype: int64

pop.loc['California':'New York']  # 还可以进行切片， 0级索引必须经过排序（A-Z)
# 可使用 pop = pop.sort_index() 进行索引的排序

states      years
California  2000     25145561
            2010     19378102
New York    2000     37253956
            2010     18976457
dtype: int64

如果索引已经排序，要使用较低层级索引，第0层索引可以使用空切片

pop[:,2010]

states
California    19378102
New York      18976457
Texas         20851820
dtype: int64

还可以使用掩码、花式索引，就不展开了

2. DataFrame 多级索引

以 health_data 数据集为例

health_data

	subject	Bob		Guido		Sue
	type	HR	Temp	HR	Temp	HR	Temp
year	visit
2013	1	30.0	38.4	43.0	35.8	34.0	36.0
2013	2	35.0	36.0	31.0	36.6	24.0	36.7
2014	1	31.0	36.6	47.0	37.2	37.0	39.0
2014	2	39.0	35.7	40.0	36.4	37.0	36.9

DataFrame的基本索引式列索引，若不使用 loc iloc ，则只能进行列索引

health_data['Guido','HR']

year  visit
2013  1        43.0
      2        31.0
2014  1        47.0
      2        40.0
Name: (Guido, HR), dtype: float64

使用DataFrame 的索引器，则可以进行行、列索引

health_data.iloc[0:2, 0:2]

	subject	Bob
	type	HR	Temp
year	visit
2013	1	30.0	38.4
2013	2	35.0	36.0

health_data.loc[:,(('Bob','Guido'), 'HR')]  # 这个案例 详细琢磨下

	subject	Bob	Guido
	type	HR	HR
year	visit
2013	1	30.0	43.0
2013	2	35.0	31.0
2014	1	31.0	47.0
2014	2	39.0	40.0

使用loc索引器，不仅可以进行行列索引，还可以进行行列的多级索引，以上就是一个很好的例子

health_data.loc[ , ] 逗号左边为行，右边为列
health_data.loc[: ,(
(列的第0级索引),(列的第一级索引 ) )
]如要进行多级索引，必须用嵌套元组的形式

这种索引元组的用法不是很方便，如果要在远足中使用切片会导致语法错误

health_data.loc[:,(:, 'HR')]

  File "", line 1
    health_data.loc[:,(:, 'HR')]
                       ^
SyntaxError: invalid syntax

3. 索引的设置与重置

索引的设置与重置能进行长短数据的转换

索引的重置
这是对Series对象执行的操作

help(pop.reset_index)

Help on method reset_index in module pandas.core.series:

reset_index(level=None, drop=False, name=None, inplace=False) method of pandas.core.series.Series instance
    Generate a new DataFrame or Series with the index reset.
    
    This is useful when the index needs to be treated as a column, or
    when the index is meaningless and needs to be reset to the default
    before another operation.
    
    Parameters
    ----------
    level : int, str, tuple, or list, default optional
        For a Series with a MultiIndex, only remove the specified levels
        from the index. Removes all levels by default.
    drop : bool, default False
        Just reset the index, without inserting it as a column in
        the new DataFrame.
    name : object, optional
        The name to use for the column containing the original Series
        values. Uses ``self.name`` by default. This argument is ignored
        when `drop` is True.
    inplace : bool, default False
        Modify the Series in place (do not create a new object).
    
    Returns
    -------
    Series or DataFrame
        When `drop` is False (the default), a DataFrame is returned.
        The newly created columns will come first in the DataFrame,
        followed by the original Series values.
        When `drop` is True, a `Series` is returned.
        In either case, if ``inplace=True``, no value is returned.
    
    See Also
    --------
    DataFrame.reset_index: Analogous function for DataFrame.
    
    Examples
    --------
    
    >>> s = pd.Series([1, 2, 3, 4], name='foo',
    ...               index=pd.Index(['a', 'b', 'c', 'd'], name='idx'))
    
    Generate a DataFrame with default index.
    
    >>> s.reset_index()
      idx  foo
    0   a    1
    1   b    2
    2   c    3
    3   d    4
    
    To specify the name of the new column use `name`.
    
    >>> s.reset_index(name='values')
      idx  values
    0   a       1
    1   b       2
    2   c       3
    3   d       4
    
    To generate a new Series with the default set `drop` to True.
    
    >>> s.reset_index(drop=True)
    0    1
    1    2
    2    3
    3    4
    Name: foo, dtype: int64
    
    To update the Series in place, without generating a new one
    set `inplace` to True. Note that it also requires ``drop=True``.
    
    >>> s.reset_index(inplace=True, drop=True)
    >>> s
    0    1
    1    2
    2    3
    3    4
    Name: foo, dtype: int64
    
    The `level` parameter is interesting for Series with a multi-level
    index.
    
    >>> arrays = [np.array(['bar', 'bar', 'baz', 'baz']),
    ...           np.array(['one', 'two', 'one', 'two'])]
    >>> s2 = pd.Series(
    ...     range(4), name='foo',
    ...     index=pd.MultiIndex.from_arrays(arrays,
    ...                                     names=['a', 'b']))
    
    To remove a specific level from the Index, use `level`.
    
    >>> s2.reset_index(level='a')
           a  foo
    b
    one  bar    0
    two  bar    1
    one  baz    2
    two  baz    3
    
    If `level` is not set, all levels are removed from the Index.
    
    >>> s2.reset_index()
         a    b  foo
    0  bar  one    0
    1  bar  two    1
    2  baz  one    2
    3  baz  two    3

pop_flat = pop.reset_index()  # 如果不指定name参数，它会自动添加列名
pop_flat

	states	years	0
0	California	2000	25145561
1	California	2010	19378102
2	New York	2000	37253956
3	New York	2010	18976457
4	Texas	2000	33871648
5	Texas	2010	20851820

pop_flat2 = pop.reset_index(name = 'population')  # 如果不指定name参数，它会自动添加列名
pop_flat2

	states	years	population
0	California	2000	25145561
1	California	2010	19378102
2	New York	2000	37253956
3	New York	2010	18976457
4	Texas	2000	33871648
5	Texas	2010	20851820

索引的设置
是跟上面相反得到一种操作
以pop_flat2 为例，它将上述的普通DataFrame 制作成多级索引的DataFrame

pop_flat2.set_index(['states', 'years'])  # 返回数据框

		population
states	years
California	2000	25145561
California	2010	19378102
New York	2000	37253956
New York	2010	18976457
Texas	2000	33871648
Texas	2010	20851820

pop_flat2.set_index( 'years')  # 返回数据框

	states	population
years
2000	California	25145561
2010	California	19378102
2000	New York	37253956
2010	New York	18976457
2000	Texas	33871648
2010	Texas	20851820

索引 stack 与 unstack

以pop数据集为例
个人认为 stack 与 unstack 进行维度的转换很方便，可以将数据集进行长短变换，以满足不同需要

pop

states      years
California  2000     25145561
            2010     19378102
New York    2000     37253956
            2010     18976457
Texas       2000     33871648
            2010     20851820
dtype: int64

使用unstack 将 states 作为列名

pop.unstack(level = 1)

years	2000	2010
states
California	25145561	19378102
New York	37253956	18976457
Texas	33871648	20851820