7.1 处理缺失值

目录

    • 7.1 处理缺失值
      • 7.1.1 过滤缺失值
      • 7.1.2 补全缺失值

7.1 处理缺失值

isnull()

In [10]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [11]: string_data
Out[11]: 
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
In [12]: string_data.isnull()
Out[12]: 
0    False
1    False
2     True
3    False
dtype: bool
In [13]: string_data[0] = None

In [14]: string_data.isnull()
Out[14]: 
0     True
1    False
2     True
3    False
dtype: bool

7.1.1 过滤缺失值

  1. dropna()
In [15]: from numpy import nan as NA
In [16]: data = pd.Series([1, NA, 3.5, NA, 7])

In [17]: data.dropna()
Out[17]: 
0    1.0
2    3.5
4    7.0
dtype: float64
In [18]: data[data.notnull()]
Out[18]: 
0    1.0
2    3.5
4    7.0
dtype: float64
  1. DataFramedropna()默认删除含NA
In [19]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
   ....:                      [NA, NA, NA], [NA, 6.5, 3.]])

In [20]: cleaned = data.dropna()
In [21]: data
Out[21]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
In [22]: cleaned
Out[22]: 
     0    1    2
0  1.0  6.5  3.0
  1. how='all'
In [23]: data.dropna(how='all')
Out[23]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0
  1. axis=1,表示
In [24]: data[4] = NA

In [25]: data
Out[25]: 
     0    1    2   4
0  1.0  6.5  3.0 NaN
1  1.0  NaN  NaN NaN
2  NaN  NaN  NaN NaN
3  NaN  6.5  3.0 NaN
In [26]: data.dropna(axis=1, how='all')
Out[26]: 
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
  1. 只想保留包含一定数量的行,你可以用thresh参数来表示:
In [27]: df = pd.DataFrame(np.random.randn(7, 3))

In [28]: df.iloc[:4, 1] = NA

In [29]: df.iloc[:2, 2] = NA
In [30]: df
Out[30]: 
          0         1         2
0 -0.204708       NaN       NaN
1 -0.555730       NaN       NaN
2  0.092908       NaN  0.769023
3  1.246435       NaN -1.296221
4  0.274992  0.228913  1.352917
5  0.886429 -2.001637 -0.371843
6  1.669025 -0.438570 -0.539741
In [31]: df.dropna()
Out[31]: 
          0         1         2
4  0.274992  0.228913  1.352917
5  0.886429 -2.001637 -0.371843
6  1.669025 -0.438570 -0.539741

In [32]: df.dropna(thresh=2)
Out[32]: 
          0         1         2
2  0.092908       NaN  0.769023
3  1.246435       NaN -1.296221
4  0.274992  0.228913  1.352917
5  0.886429 -2.001637 -0.371843
6  1.669025 -0.438570 -0.539741

7.1.2 补全缺失值

fillna

In [33]: df.fillna(0)
Out[33]: 
          0         1         2
0 -0.204708  0.000000  0.000000
1 -0.555730  0.000000  0.000000
2  0.092908  0.000000  0.769023
3  1.246435  0.000000 -1.296221
4  0.274992  0.228913  1.352917
5  0.886429 -2.001637 -0.371843
6  1.669025 -0.438570 -0.539741

使用字典,为不同设定不同的填充值:

In [34]: df.fillna({1: 0.5, 2: 0})
Out[34]: 
          0         1         2
0 -0.204708  0.500000  0.000000
1 -0.555730  0.500000  0.000000
2  0.092908  0.500000  0.769023
3  1.246435  0.500000 -1.296221
4  0.274992  0.228913  1.352917
5  0.886429 -2.001637 -0.371843
6  1.669025 -0.438570 -0.539741

fillna返回的是一个新的对象,但你也可以修改已经存在的对象:_代表自己

In [35]: _ = df.fillna(0, inplace=True)

In [36]: df
Out[36]: 
          0         1         2
0 -0.204708  0.000000  0.000000
1 -0.555730  0.000000  0.000000
2  0.092908  0.000000  0.769023
3  1.246435  0.000000 -1.296221
4  0.274992  0.228913  1.352917
5  0.886429 -2.001637 -0.371843
6  1.669025 -0.438570 -0.539741

用于重建索引的相同的插值方法,也可以用于fillna:

In [37]: df = pd.DataFrame(np.random.randn(6, 3))

In [38]: df.iloc[2:, 1] = NA

In [39]: df.iloc[4:, 2] = NA
In [40]: df
Out[40]: 
          0         1         2
0  0.476985  3.248944 -1.021228
1 -0.577087  0.124121  0.302614
2  0.523772       NaN  1.343810
3 -0.713544       NaN -2.370232
4 -1.860761       NaN       NaN
5 -1.265934       NaN       NaN

ffill

In [41]: df.fillna(method='ffill')
Out[41]: 
          0         1         2
0  0.476985  3.248944 -1.021228
1 -0.577087  0.124121  0.302614
2  0.523772  0.124121  1.343810
3 -0.713544  0.124121 -2.370232
4 -1.860761  0.124121 -2.370232
5 -1.265934  0.124121 -2.370232

ffilllimit

In [42]: df.fillna(method='ffill', limit=2)
Out[42]: 
0         1         2
0  0.476985  3.248944 -1.021228
1 -0.577087  0.124121  0.302614
2  0.523772  0.124121  1.343810
3 -0.713544  0.124121 -2.370232
4 -1.860761       NaN -2.370232
5 -1.265934       NaN -2.370232

使用fillna你可以完成很多带有一点创造性的工作。例如,你可以将Series的平均值或中位数用于填充缺失值:

In [43]: data = pd.Series([1., NA, 3.5, NA, 7])

In [44]: data.fillna(data.mean())
Out[44]: 
0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64
表7-2是fillna的参考。

7.1 处理缺失值_第1张图片

你可能感兴趣的:(#,7.数据清洗与准备,pandas,python,数据分析)