Python 数据处理（二十七）—

20 索引对象

索引几乎都是不可变的，但是可以设置和更改它们的 name 属性。您可以使用 rename、set_names 直接设置这些属性，它们默认会返回一个拷贝

In [315]: ind = pd.Index([1, 2, 3])

In [316]: ind.rename("apple")
Out[316]: Int64Index([1, 2, 3], dtype='int64', name='apple')

In [317]: ind
Out[317]: Int64Index([1, 2, 3], dtype='int64')

In [318]: ind.set_names(["apple"], inplace=True)

In [319]: ind.name = "bob"

In [320]: ind
Out[320]: Int64Index([1, 2, 3], dtype='int64', name='bob')

set_names、set_levels 和 set_codes 还有一个可选的 level 参数

In [321]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])

In [322]: index
Out[322]: 
MultiIndex([(0, 'one'),
            (0, 'two'),
            (1, 'one'),
            (1, 'two'),
            (2, 'one'),
            (2, 'two')],
           names=['first', 'second'])

In [323]: index.levels[1]
Out[323]: Index(['one', 'two'], dtype='object', name='second')

In [324]: index.set_levels(["a", "b"], level=1)
Out[324]: 
MultiIndex([(0, 'a'),
            (0, 'b'),
            (1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['first', 'second'])

20.1 索引对象的集合操作

索引对象的集合操作包括 union、intersection 和 difference

例如，差集

In [325]: a = pd.Index(['c', 'b', 'a'])

In [326]: b = pd.Index(['c', 'e', 'd'])

In [327]: a.difference(b)
Out[327]: Index(['a', 'b'], dtype='object')

并集和交集

>>> a.union(b)
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

>>> a.intersection(b)
Index(['c'], dtype='object')

还可以使用 symmetric_difference 操作，它返回出现在 idx1 或 idx2 中的元素，但不同时出现在两个索引中的元素。这相当于

idx1.difference(idx2).union(idx2.difference(idx1))

In [328]: idx1 = pd.Index([1, 2, 3, 4])

In [329]: idx2 = pd.Index([2, 3, 4, 5])

In [330]: idx1.symmetric_difference(idx2)
Out[330]: Int64Index([1, 5], dtype='int64')

注意：集合操作产生的索引将按升序排序

当在具有不同 dtype 的索引之间执行 Index.union() 时，索引必须能转换为一个公共 dtype。通常是 object dtype

比如，整数索引和浮点索引取并集

In [331]: idx1 = pd.Index([0, 1, 2])

In [332]: idx2 = pd.Index([0.5, 1.5])

In [333]: idx1.union(idx2)
Out[333]: Float64Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')

20.2 缺失值

提示：虽然索引也支持缺失值（NaN），但是最好不要使用，因为有些操作默认会忽略缺失值

可以使用 Index.fillna 用于指定的标量值来填充缺失值

In [334]: idx1 = pd.Index([1, np.nan, 3, 4])

In [335]: idx1
Out[335]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [336]: idx1.fillna(2)
Out[336]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

In [337]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
   .....:                          pd.NaT,
   .....:                          pd.Timestamp('2011-01-03')])
   .....: 

In [338]: idx2
Out[338]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)

In [339]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[339]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

21 设置和重置索引

有时，你会从数据集中加载或创建 DataFrame，并希望添加一些索引。

一般有以下几种方法

21.1 设置索引

DataFrame 有一个 set_index() 方法，它接受一个列名或列名列表，来创建一个新的索引或重新创建索引

In [340]: data
Out[340]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

In [341]: indexed1 = data.set_index('c')

In [342]: indexed1
Out[342]: 
     a    b    d
c               
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0

In [343]: indexed2 = data.set_index(['a', 'b'])

In [344]: indexed2
Out[344]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

append 关键字参数允许你保留现有的索引，并将给定的列追加到索引

In [345]: frame = data.set_index('c', drop=False)

In [346]: frame = frame.set_index(['a', 'b'], append=True)

In [347]: frame
Out[347]: 
           c    d
c a   b          
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

set_index 还有其他参数，允许你不删除索引列或原地添加索引(不创建新的对象)

In [348]: data.set_index('c', drop=False)
Out[348]: 
     a    b  c    d
c                  
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

In [349]: data.set_index(['a', 'b'], inplace=True)

In [350]: data
Out[350]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

21.2 重置索引

为了方便起见，DataFrame 上有一个名为 reset_index() 的函数，它将索引值添加到 DataFrame 的列中，并设置一个简单的整数索引。相当于 set_index() 的逆操作

In [351]: data
Out[351]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

In [352]: data.reset_index()
Out[352]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

新列的列名为索引的 name 属性，你可以使用 level 关键字删除部分索引

In [353]: frame
Out[353]: 
           c    d
c a   b          
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

In [354]: frame.reset_index(level=1)
Out[354]: 
         a  c    d
c b               
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0

reset_index 接受一个可选参数 drop，如果为真，则直接丢弃索引，而不是将索引值放入 DataFrame 的列中

为索引赋值

data.index = index

22 返回视图或拷贝

在设置 pandas 对象中的值时，必须小心避免所谓的链接索引。

对于下面的一个例子

In [355]: dfmi = pd.DataFrame([list('abcd'),
   .....:                      list('efgh'),
   .....:                      list('ijkl'),
   .....:                      list('mnop')],
   .....:                     columns=pd.MultiIndex.from_product([['one', 'two'],
   .....:                                                         ['first', 'second']]))
   .....: 

In [356]: dfmi
Out[356]: 
    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p

比较这两种访问方式

In [357]: dfmi['one']['second']
Out[357]: 
0    b
1    f
2    j
3    n
Name: second, dtype: object

In [358]: dfmi.loc[:, ('one', 'second')]
Out[358]: 
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

这两者产生相同的结果，那么您应该使用哪一种呢？应该如何理解这些操作的顺序呢？

对于第一种方法，dfmi['one'] 先获取第一个 level 的数据，然后 ['second'] 是对前一步的结果进一步取值的。调用了两次获取属性的方法 __getitem__ ，这些步骤是分开的，像是链条一样一个接着一个

而 df.loc[:,('one','second')] 会解析为嵌套元组 (slice(None),('one','second')) 并调用一次 __getitem__，这种操作的速度会更快。

22.1 执行顺序

使用链式索引时，索引操作的类型和顺序部分决定了返回的结果是原始对象的切片还是切片的拷贝。

pandas 之所以有 SettingWithCopyWarning，是因为通常为一个切片的拷贝赋值是无意识的，而是链式索引返回一个切片的拷贝所造成的错误

如果你想让 pandas 或多或少地支持对链式索引表达式的赋值，可以将 mode.chained_assignment 设置为下面的值

'warn': 默认值，打印出 SettingWithCopyWarning 信息
'raise': 引发 SettingWithCopyException 异常
None: 完全消除警告

In [359]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
   .....:                           'three', 'two', 'one', 'six'],
   .....:                     'c': np.arange(7)})
   .....: 

# 将会显示 SettingWithCopyWarning
# 但是设置值的操作还是会生效
In [360]: dfb['c'][dfb['a'].str.startswith('o')] = 42

但是在切片拷贝上操作不会生效

>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb['a'].str.startswith('o')]['c'] = 42
Traceback (most recent call last)
     ...
SettingWithCopyWarning:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead

注意：这些设置规则适用于所有 .loc/.iloc

以下是推荐使用的访问方法，在 .loc 中使用固定索引来访问单个或多个数据

In [361]: dfc = pd.DataFrame({'a': ['one', 'one', 'two',
   .....:                           'three', 'two', 'one', 'six'],
   .....:                     'c': np.arange(7)})
   .....: 

In [362]: dfd = dfc.copy()

# 设置多个值
In [363]: mask = dfd['a'].str.startswith('o')

In [364]: dfd.loc[mask, 'c'] = 42

In [365]: dfd
Out[365]: 
       a   c
0    one  42
1    one  42
2    two   2
3  three   3
4    two   4
5    one  42
6    six   6

# 设置单个值
In [366]: dfd = dfc.copy()

In [367]: dfd.loc[2, 'a'] = 11

In [368]: dfd
Out[368]: 
       a  c
0    one  0
1    one  1
2     11  2
3  three  3
4    two  4
5    one  5
6    six  6

下面的方法只是偶尔发挥作用，所以应该避免使用

In [369]: dfd = dfc.copy()

In [370]: dfd['a'][2] = 111

In [371]: dfd
Out[371]: 
       a  c
0    one  0
1    one  1
2    111  2
3  three  3
4    two  4
5    one  5
6    six  6

最后的例子根本不起作用，所以应该避免

>>> pd.set_option('mode.chained_assignment','raise')
>>> dfd.loc[0]['a'] = 1111
Traceback (most recent call last)
     ...
SettingWithCopyException:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead

Python 数据处理（二十七）—— index 对象