pandas 学习心得(3):层级索引

阅读体验不佳(与有道云笔记的markdown解析不同),因此建议进入传送门

jupyter notebook:pandas 学习心得(3):层级索引

这个系列是我学习《python数据科学手册》所做的笔记

用于个人备忘

顺便分享,因此存在不严谨的地方或者述说不清晰的地方

Series多级索引

import numpy as np
import pandas as pd

多级索引的作用: 用低维的Series 或 DataFrame 表示更高维的数据
首先在不知道pandas 提供多级索引的条件下,创造一个Series 数据集

index= {('California', 2000),('California',2010),
        ('New York',2000),('New York',2010),
        ('Texas',2000),('Texas',2010)}
populations = [33871648,37253956,
              18976457,19378102,
              20851820,25145561]
pop = pd.Series(populations, index = index)
pop
Texas       2000    33871648
New York    2000    37253956
            2010    18976457
California  2010    19378102
Texas       2010    20851820
California  2000    25145561
dtype: int64

查看我们设置的索引长什么样子

index

{('California', 2000),
 ('California', 2010),
 ('New York', 2000),
 ('New York', 2010),
 ('Texas', 2000),
 ('Texas', 2010)}

这是有用元组构成的多级索引,应用起来诸多不便
而且,上面pop 两个California 怎么不挨在一起,强迫症受不了!

  • pandas 多级索引
    现在我们利用 笛卡儿积 生成多级索引
index = pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])  
index
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

接着,我们将pop的索引进行重置,就能看到层级索引了

pop = pop.reindex(index)
pop
California  2000    25145561
            2010    19378102
New York    2000    37253956
            2010    18976457
Texas       2000    33871648
            2010    20851820
dtype: int64

好看多了, 其中最左边的索引为 0级索引,2000这些为1级索引,以此类推。
这个对象还是一个Series序列

现在,可以直接利用第二个索引,获取2010年的全部数据了

pop[:,2010]  # [a,b] a表示 California 这些地名,b 表示2000这些年份
California    19378102
New York      18976457
Texas         20851820
dtype: int64

多级索引的创建方法

1. 显式地创建多级索引

pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])  # 从简单数组中创建
MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])  # 从元组中创建
MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
pd.MultiIndex.from_product([['California','New York','Texas'],[2000,2010]])  # 从笛卡尔积中创建,已经了解过了
MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

更详细地,可以直接提供levels和 labels 进行创建

pd.MultiIndex(levels = [['a','b'],[1,2]],
             labels = [[0,0,1,1],[0,1,0,1]])
MultiIndex(levels=[['a', 'b'], [1, 2]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

levels 有 两个列表, 分别表示 第0级索引和第1级索引
labels 也有两个列表,两个列表长度(数据集元素的个数)相同,分表表示 数据 取自第0级索引和 1级索引的 第几个标签,结合笛卡尔积理解

给多级索引加上名称,可以方便管理

pop.index.names = ['states','years']
pop
states      years
California  2000     25145561
            2010     19378102
New York    2000     37253956
            2010     18976457
Texas       2000     33871648
            2010     20851820
dtype: int64

2. 多级列索引

对于DataFrame,有多级行索引,就存在多级列索引
下面模拟一个医疗数据的 DataFrame

index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])
data = np.round(np.random.randn(4,6), 1)
data[:,::2] *= 10
data += 37
data
array([[30. , 38.4, 43. , 35.8, 34. , 36. ],
       [35. , 36. , 31. , 36.6, 24. , 36.7],
       [31. , 36.6, 47. , 37.2, 37. , 39. ],
       [39. , 35.7, 40. , 36.4, 37. , 36.9]])
health_data = pd.DataFrame(data, index = index, columns = columns)
health_data









































































subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 30.0 38.4 43.0 35.8 34.0 36.0
2 35.0 36.0 31.0 36.6 24.0 36.7
2014 1 31.0 36.6 47.0 37.2 37.0 39.0
2 39.0 35.7 40.0 36.4 37.0 36.9

  • 对DataFrame提供一个索引, 只能查询 第0级列索引
health_data['Guido']  # health_data['HR']  会报错










































type HR Temp
year visit
2013 1 43.0 35.8
2 31.0 36.6
2014 1 47.0 37.2
2 40.0 36.4

多级索引的取值操作

1. Series 多级索引

以pop 数据集为例

pop
states      years
California  2000     25145561
            2010     19378102
New York    2000     37253956
            2010     18976457
Texas       2000     33871648
            2010     20851820
dtype: int64
pop['California',2000]  # 注意各级索引的位置
25145561
pop['California']  # 如果只提供一个,不加逗号,那么只能在 0级索引中挑选,pop[2010] 报错
years
2000    25145561
2010    19378102
dtype: int64
pop.loc['California':'New York']  # 还可以进行切片, 0级索引必须经过排序(A-Z)
# 可使用 pop = pop.sort_index() 进行索引的排序
states      years
California  2000     25145561
            2010     19378102
New York    2000     37253956
            2010     18976457
dtype: int64
  • 如果索引已经排序,要使用较低层级索引, 第0层索引可以使用空切片
pop[:,2010]
states
California    19378102
New York      18976457
Texas         20851820
dtype: int64

还可以使用 掩码、花式索引,就不展开了

2. DataFrame 多级索引

以 health_data 数据集为例

health_data









































































subject Bob Guido Sue
type HR Temp HR Temp HR Temp
year visit
2013 1 30.0 38.4 43.0 35.8 34.0 36.0
2 35.0 36.0 31.0 36.6 24.0 36.7
2014 1 31.0 36.6 47.0 37.2 37.0 39.0
2 39.0 35.7 40.0 36.4 37.0 36.9

  • DataFrame的基本索引式列索引,若不使用 loc iloc ,则只能进行列索引
health_data['Guido','HR']
year  visit
2013  1        43.0
      2        31.0
2014  1        47.0
      2        40.0
Name: (Guido, HR), dtype: float64
  • 使用DataFrame 的索引器,则可以进行行、列索引
health_data.iloc[0:2, 0:2]




































subject Bob
type HR Temp
year visit
2013 1 30.0 38.4
2 35.0 36.0

health_data.loc[:,(('Bob','Guido'), 'HR')]  # 这个案例 详细琢磨下
















































subject Bob Guido
type HR HR
year visit
2013 1 30.0 43.0
2 35.0 31.0
2014 1 31.0 47.0
2 39.0 40.0

使用loc索引器,不仅可以进行行列索引,还可以进行行列的多级索引,以上就是一个很好的例子

  • health_data.loc[ , ] 逗号左边 为行, 右边为列
  • health_data.loc[: ,(
    (列的第0级索引),(列的第一级索引 ) )
    ]如要进行多级索引,必须用嵌套元组的形式

这种索引元组的用法不是很方便,如果要在远足中使用切片会导致语法错误

health_data.loc[:,(:, 'HR')]  
  File "", line 1
    health_data.loc[:,(:, 'HR')]
                       ^
SyntaxError: invalid syntax

3. 索引的设置与重置

索引的设置与重置能进行长短数据的转换

  • 索引的重置
    这是对Series对象执行的操作
help(pop.reset_index)
Help on method reset_index in module pandas.core.series:

reset_index(level=None, drop=False, name=None, inplace=False) method of pandas.core.series.Series instance
    Generate a new DataFrame or Series with the index reset.
    
    This is useful when the index needs to be treated as a column, or
    when the index is meaningless and needs to be reset to the default
    before another operation.
    
    Parameters
    ----------
    level : int, str, tuple, or list, default optional
        For a Series with a MultiIndex, only remove the specified levels
        from the index. Removes all levels by default.
    drop : bool, default False
        Just reset the index, without inserting it as a column in
        the new DataFrame.
    name : object, optional
        The name to use for the column containing the original Series
        values. Uses ``self.name`` by default. This argument is ignored
        when `drop` is True.
    inplace : bool, default False
        Modify the Series in place (do not create a new object).
    
    Returns
    -------
    Series or DataFrame
        When `drop` is False (the default), a DataFrame is returned.
        The newly created columns will come first in the DataFrame,
        followed by the original Series values.
        When `drop` is True, a `Series` is returned.
        In either case, if ``inplace=True``, no value is returned.
    
    See Also
    --------
    DataFrame.reset_index: Analogous function for DataFrame.
    
    Examples
    --------
    
    >>> s = pd.Series([1, 2, 3, 4], name='foo',
    ...               index=pd.Index(['a', 'b', 'c', 'd'], name='idx'))
    
    Generate a DataFrame with default index.
    
    >>> s.reset_index()
      idx  foo
    0   a    1
    1   b    2
    2   c    3
    3   d    4
    
    To specify the name of the new column use `name`.
    
    >>> s.reset_index(name='values')
      idx  values
    0   a       1
    1   b       2
    2   c       3
    3   d       4
    
    To generate a new Series with the default set `drop` to True.
    
    >>> s.reset_index(drop=True)
    0    1
    1    2
    2    3
    3    4
    Name: foo, dtype: int64
    
    To update the Series in place, without generating a new one
    set `inplace` to True. Note that it also requires ``drop=True``.
    
    >>> s.reset_index(inplace=True, drop=True)
    >>> s
    0    1
    1    2
    2    3
    3    4
    Name: foo, dtype: int64
    
    The `level` parameter is interesting for Series with a multi-level
    index.
    
    >>> arrays = [np.array(['bar', 'bar', 'baz', 'baz']),
    ...           np.array(['one', 'two', 'one', 'two'])]
    >>> s2 = pd.Series(
    ...     range(4), name='foo',
    ...     index=pd.MultiIndex.from_arrays(arrays,
    ...                                     names=['a', 'b']))
    
    To remove a specific level from the Index, use `level`.
    
    >>> s2.reset_index(level='a')
           a  foo
    b
    one  bar    0
    two  bar    1
    one  baz    2
    two  baz    3
    
    If `level` is not set, all levels are removed from the Index.
    
    >>> s2.reset_index()
         a    b  foo
    0  bar  one    0
    1  bar  two    1
    2  baz  one    2
    3  baz  two    3
pop_flat = pop.reset_index()  # 如果不指定name参数,它会自动添加列名
pop_flat


















































states years 0
0 California 2000 25145561
1 California 2010 19378102
2 New York 2000 37253956
3 New York 2010 18976457
4 Texas 2000 33871648
5 Texas 2010 20851820

pop_flat2 = pop.reset_index(name = 'population')  # 如果不指定name参数,它会自动添加列名
pop_flat2


















































states years population
0 California 2000 25145561
1 California 2010 19378102
2 New York 2000 37253956
3 New York 2010 18976457
4 Texas 2000 33871648
5 Texas 2010 20851820

  • 索引的设置
    是跟上面相反得到一种操作
    以pop_flat2 为例,它将上述的普通DataFrame 制作成多级索引的DataFrame
pop_flat2.set_index(['states', 'years'])  # 返回数据框













































population
states years
California 2000 25145561
2010 19378102
New York 2000 37253956
2010 18976457
Texas 2000 33871648
2010 20851820

pop_flat2.set_index( 'years')  # 返回数据框
















































states population
years
2000 California 25145561
2010 California 19378102
2000 New York 37253956
2010 New York 18976457
2000 Texas 33871648
2010 Texas 20851820

索引 stack 与 unstack

以pop数据集为例
个人认为 stack 与 unstack 进行维度的转换很方便,可以将数据集进行长短变换,以满足不同需要

pop
states      years
California  2000     25145561
            2010     19378102
New York    2000     37253956
            2010     18976457
Texas       2000     33871648
            2010     20851820
dtype: int64

使用unstack 将 states 作为列名

pop.unstack(level = 1)

































years 2000 2010
states
California 25145561 19378102
New York 37253956 18976457
Texas 33871648 20851820

你可能感兴趣的:(pandas 学习心得(3):层级索引)