Python 数据处理（二十五）—— 索引和选择数据（续）

13 使用 isin 索引

isin() 方法，顾名思义，就是判断 pandas 对象的每个元素是否存在传入的对象（Series、DataFram、dict 以及可迭代对象）中，返回一个布尔值 DataFrame

In [165]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

In [166]: s
Out[166]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [167]: s.isin([2, 4, 6])
Out[167]: 
4    False
3    False
2     True
1    False
0     True
dtype: bool

In [168]: s[s.isin([2, 4, 6])]
Out[168]: 
2    2
0    4
dtype: int64

同样，该方法也适用于 Index 对象，当你不知道所要寻找的标签中哪些是真实存在的时候，这一方法是很有用的

In [169]: s[s.index.isin([2, 4, 6])]
Out[169]: 
4    0
2    2
dtype: int64

# 注意与下面代码的区别
In [170]: s.reindex([2, 4, 6])
Out[170]: 
2    2.0
4    0.0
6    NaN
dtype: float64

对于 MultiIndex，还可以指定索引的级别

In [171]: s_mi = pd.Series(np.arange(6),
   .....:                  index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
   .....: 

In [172]: s_mi
Out[172]: 
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int64

In [173]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[173]: 
0  c    2
1  a    3
dtype: int64

In [174]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[174]: 
0  a    0
   c    2
1  a    3
   c    5
dtype: int64

DataFrame 也有 isin() 方法，如果传递的是数组、列表或序列，会返回一个与原 DataFrame 大小相同的布尔 DataFrame

In [175]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
   .....:                    'ids2': ['a', 'n', 'c', 'n']})
   .....: 

In [176]: values = ['a', 'b', 1, 3]

In [177]: df.isin(values)
Out[177]: 
    vals    ids   ids2
0   True   True   True
1  False   True  False
2   True  False  False
3  False  False  False

通常，您希望将特定的值与特定的列相匹配。只需将 values 设置为一个字典，其中键是列名，值是要检查的值的列表。

In [178]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}

In [179]: df.isin(values)
Out[179]: 
    vals    ids   ids2
0   True   True  False
1  False   True  False
2   True  False  False
3  False  False  False

将 DataFrame 的 isin 与 any() 和 all() 方法结合起来，可以快速选择满足给定条件的数据子集。

In [180]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [181]: row_mask = df.isin(values).all(1)

In [182]: df[row_mask]
Out[182]: 
   vals ids ids2
0     1   a    a

14 `where()` 和 `mask()` 方法

用布尔向量从 Series 中选择值，一般会返回数据的一个子集。为了保证选择的数据与原始数据的形状相同，可以使用 Series 和 DataFrame 中的 where 方法

返回选定的行

In [183]: s[s > 0]
Out[183]: 
3    1
2    2
1    3
0    4
dtype: int64

返回一个与原始数据形状相同的 Series

In [184]: s.where(s > 0)
Out[184]: 
4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

下面的代码与 df.where(df < 0) 功能一样

In [185]: df[df < 0]
Out[185]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

此外，where 接受一个可选的 other 参数，用于指定条件为 False 的值的替换值

In [186]: df.where(df < 0, -df)
Out[186]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838

您可能希望根据一些布尔条件设置值

In [187]: s2 = s.copy()

In [188]: s2[s2 < 0] = 0

In [189]: s2
Out[189]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [190]: df2 = df.copy()

In [191]: df2[df2 < 0] = 0

In [192]: df2
Out[192]: 
                   A         B         C         D
2000-01-01  0.000000  0.000000  0.485855  0.245166
2000-01-02  0.000000  0.390389  0.000000  1.655824
2000-01-03  0.000000  0.299674  0.000000  0.281059
2000-01-04  0.846958  0.000000  0.600705  0.000000
2000-01-05  0.669692  0.000000  0.000000  0.342416
2000-01-06  0.868584  0.000000  2.297780  0.000000
2000-01-07  0.000000  0.000000  0.168904  0.000000
2000-01-08  0.801196  1.392071  0.000000  0.000000

默认情况下，where 返回数据的拷贝后的修改数据。这里有一个可选参数 inplace，可以在不创建副本的情况下修改原始数据。

In [193]: df_orig = df.copy()

In [194]: df_orig.where(df > 0, -df, inplace=True)

In [195]: df_orig
Out[195]: 
                   A         B         C         D
2000-01-01  2.104139  1.309525  0.485855  0.245166
2000-01-02  0.352480  0.390389  1.192319  1.655824
2000-01-03  0.864883  0.299674  0.227870  0.281059
2000-01-04  0.846958  1.222082  0.600705  1.233203
2000-01-05  0.669692  0.605656  1.169184  0.342416
2000-01-06  0.868584  0.948458  2.297780  0.684718
2000-01-07  2.670153  0.114722  0.168904  0.048048
2000-01-08  0.801196  1.392071  0.048788  0.808838

注意：

pandas 的 df1.where(m, df2) 与 numpy 的 np.where(m, df1, df2) 相等

In [196]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[196]: 
               A     B     C     D
2000-01-01  True  True  True  True
2000-01-02  True  True  True  True
2000-01-03  True  True  True  True
2000-01-04  True  True  True  True
2000-01-05  True  True  True  True
2000-01-06  True  True  True  True
2000-01-07  True  True  True  True
2000-01-08  True  True  True  True

对齐

此外，where 会与布尔输入对齐，与 .loc 的部分选择和部分设置类似

In [197]: df2 = df.copy()

In [198]: df2[df2[1:4] > 0] = 3

In [199]: df2
Out[199]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525  0.485855  0.245166
2000-01-02 -0.352480  3.000000 -1.192319  3.000000
2000-01-03 -0.864883  3.000000 -0.227870  3.000000
2000-01-04  3.000000 -1.222082  3.000000 -1.233203
2000-01-05  0.669692 -0.605656 -1.169184  0.342416
2000-01-06  0.868584 -0.948458  2.297780 -0.684718
2000-01-07 -2.670153 -0.114722  0.168904 -0.048048
2000-01-08  0.801196  1.392071 -0.048788 -0.808838

where 还可以接受 axis 和 level 参数来对齐输入

In [200]: df2 = df.copy()

In [201]: df2.where(df2 > 0, df2['A'], axis='index')
Out[201]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

这等效于下面的代码，但是速度更快

In [202]: df2 = df.copy()

In [203]: df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])
Out[203]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196

where 的条件参数和 other 参数接受一个可调用函数，函数必须传入一个参数，并返回有效的输出作为条件和 other 参数。

In [204]: df3 = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [4, 5, 6],
   .....:                     'C': [7, 8, 9]})
   .....: 

In [205]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[205]: 
    A   B  C
0  11  14  7
1  12   5  8
2  13   6  9

mask

mask() 是 where 的逆布尔运算

In [206]: s.mask(s >= 0)
Out[206]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64

In [207]: df.mask(df >= 0)
Out[207]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838

15 使用 numpy 函数进行有条件的放大设置

where() 的替代方法是使用 numpy.where()。通过与设置新列相结合，可以对数据进行放大，其中的值是根据条件确定的。

In [208]: df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})

In [209]: df['color'] = np.where(df['col2'] == 'Z', 'green', 'red')

In [210]: df
Out[210]: 
  col1 col2  color
0    A    Z  green
1    B    Z  green
2    B    X    red
3    C    Y    red

考虑到在下面的数据，你有两个选择。你想在第二列有 'Z' 的时候，将新的列颜色设置为 'green'。你可以执行以下操作

In [208]: df = pd.DataFrame({'col1': list('ABBC'), 'col2': list('ZZXY')})

In [209]: df['color'] = np.where(df['col2'] == 'Z', 'green', 'red')

In [210]: df
Out[210]: 
  col1 col2  color
0    A    Z  green
1    B    Z  green
2    B    X    red
3    C    Y    red

如果有多个条件，可以使用 numpy.select()。

例如，如果说对应三个条件有三种颜色可以选择，第四种颜色作为备用，可以做如下处理

In [211]: conditions = [
   .....:     (df['col2'] == 'Z') & (df['col1'] == 'A'),
   .....:     (df['col2'] == 'Z') & (df['col1'] == 'B'),
   .....:     (df['col1'] == 'B')
   .....: ]
   .....: 

In [212]: choices = ['yellow', 'blue', 'purple']

In [213]: df['color'] = np.select(conditions, choices, default='black')

In [214]: df
Out[214]: 
  col1 col2   color
0    A    Z  yellow
1    B    Z    blue
2    B    X  purple
3    C    Y   black

16 重复数据

如果你想识别和删除 DataFrame 中的重复行，可以使用下面两个方法：

duplicate: 返回一个布尔向量，其长度与数据行数一样，指示对应的行是否重复
drop_duplicates: 删除重复的行

默认情况下，第一个观察到的行被认为是唯一的（即保留重复的第一个），但是每个方法都有一个 keep 参数来指定要保存的目标

keep='first'(default): 保留重复数据的第一个
keep='last': 保留重复数据的最后一个
keep=False: 删除所有存在重复的行

In [281]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
   .....:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

In [282]: df2
Out[282]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [283]: df2.duplicated('a')
Out[283]: 
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [284]: df2.duplicated('a', keep='last')
Out[284]: 
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [285]: df2.duplicated('a', keep=False)
Out[285]: 
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [286]: df2.drop_duplicates('a')
Out[286]: 
       a  b         c
0    one  x -1.067137
2    two  x -0.211056
5  three  x -1.964475
6   four  x  1.298329

In [287]: df2.drop_duplicates('a', keep='last')
Out[287]: 
       a  b         c
1    one  y  0.309500
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [288]: df2.drop_duplicates('a', keep=False)
Out[288]: 
       a  b         c
5  three  x -1.964475
6   four  x  1.298329

此外，还可以传递列名列表来标识重复行

In [289]: df2.duplicated(['a', 'b'])
Out[289]: 
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

In [290]: df2.drop_duplicates(['a', 'b'])
Out[290]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
5  three  x -1.964475
6   four  x  1.298329

要按索引值删除重复的内容，请使用 Index.deproicated 然后执行切片。也包含同样的 keep 参数

In [291]: df3 = pd.DataFrame({'a': np.arange(6),
   .....:                     'b': np.random.randn(6)},
   .....:                    index=['a', 'a', 'b', 'c', 'b', 'a'])
   .....: 

In [292]: df3
Out[292]: 
   a         b
a  0  1.440455
a  1  2.456086
b  2  1.038402
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [293]: df3.index.duplicated()
Out[293]: array([False,  True, False, False,  True,  True])

In [294]: df3[~df3.index.duplicated()]
Out[294]: 
   a         b
a  0  1.440455
b  2  1.038402
c  3 -0.894409

In [295]: df3[~df3.index.duplicated(keep='last')]
Out[295]: 
   a         b
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [296]: df3[~df3.index.duplicated(keep=False)]
Out[296]: 
   a         b
c  3 -0.894409

17 类似字典个 get 方法

每个 Series 或 DataFrame 都有一个可以返回默认值的 get 方法

In [297]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [298]: s.get('a')  # equivalent to s['a']
Out[298]: 1

In [299]: s.get('x', default=-1)
Out[299]: -1

18 使用索引或列标签查找值

有时，您希望提取一组给定行标签和列标签序列的值，这可以通过使用 DataFrame.melt 和 DataFrame.loc 来实现

In [300]: df = pd.DataFrame({'col': ["A", "A", "B", "B"],
   .....:                    'A': [80, 23, np.nan, 22],
   .....:                    'B': [80, 55, 76, 67]})
   .....: 

In [301]: df
Out[301]: 
  col     A   B
0   A  80.0  80
1   A  23.0  55
2   B   NaN  76
3   B  22.0  67

In [302]: melt = df.melt('col')

In [303]: melt = melt.loc[melt['col'] == melt['variable'], 'value']

In [304]: melt.reset_index(drop=True)
Out[304]: 
0    80.0
1    23.0
2    76.0
3    67.0
Name: value, dtype: float64