目录
处理缺失数据
滤除缺失数据
填补缺失数据
pandas的设计目标之一就是让缺失数据的处理任务更轻松,pandas使用浮点值NaN表示浮点数组和非浮点数组中的缺失数据,是一个便于被检测的标记
python内置的None也会被当作NA处理
from pandas import Series
>>> string_data = Series(['aardvark','artichoke',np.nan,'avocado'])
>>> string_data
0 aardvark
1 artichoke
2 NaN
3 avocado
dtype: object
>>> string_data.isnull()
0 False
1 False
2 True
3 False
dtype: bool
>>> string_data[0] = None
>>> string_data.isnull()
0 True
1 False
2 True
3 False
dropna返回一个仅含非空数据和索引值的Series.
>>> from numpy import nan as NA
>>> data = Series([1,NA,3.5,NA,7])
>>> data.dropna()
0 1.0
2 3.5
4 7.0
dtype: float64
>>> data = data[data.notnull()]
>>> data
0 1.0
2 3.5
4 7.0
dtype: float64
对于DataFrame对象,dropna默认丢弃含有缺失值的行
传入how=‘all’即可只丢弃值全为NA的那些行
再次传入axis= 1即可丢弃列
>>> data = DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]])
>>> cleaned = data.dropna()
>>> data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
>>> cleaned
0 1 2
0 1.0 6.5 3.0
>>> data.dropna(how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
>>> data[4] = NA
>>> data.dropna(axis = 1,how='all')
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
对于过滤时间序列数据,假如只需要与部分观测数据,可以用thresh参数实现目的
>>> df = DataFrame(np.random.randn(7,3))
>>> df.ix[:4,1] = NA;df.ix[:2,2] = NA
>>> df
0 1 2
0 -0.470646 NaN NaN
1 0.650343 NaN NaN
2 0.616738 NaN NaN
3 -0.244354 NaN 0.863931
4 -0.063218 NaN 0.978948
5 0.145755 0.630745 1.120421
6 -0.111001 0.322309 -0.860522
>>> df.dropna(thresh = 3)
0 1 2
5 0.145755 0.630745 1.120421
6 -0.111001 0.322309 -0.860522
>>> df.dropna(thresh = 2)
0 1 2
3 -0.244354 NaN 0.863931
4 -0.063218 NaN 0.978948
5 0.145755 0.630745 1.120421
6 -0.111001 0.322309 -0.860522
>>> df.dropna(thresh = 1)
0 1 2
0 -0.470646 NaN NaN
1 0.650343 NaN NaN
2 0.616738 NaN NaN
3 -0.244354 NaN 0.863931
4 -0.063218 NaN 0.978948
5 0.145755 0.630745 1.120421
6 -0.111001 0.322309 -0.860522
>>> df.dropna(thresh = 0)
0 1 2
0 -0.470646 NaN NaN
1 0.650343 NaN NaN
2 0.616738 NaN NaN
3 -0.244354 NaN 0.863931
4 -0.063218 NaN 0.978948
5 0.145755 0.630745 1.120421
6 -0.111001 0.322309 -0.860522
通常·使用fillna函数将缺失值替换为常数值
>>> df.fillna(0)
0 1 2
0 -0.470646 0.000000 0.000000
1 0.650343 0.000000 0.000000
2 0.616738 0.000000 0.000000
3 -0.244354 0.000000 0.863931
4 -0.063218 0.000000 0.978948
5 0.145755 0.630745 1.120421
6 -0.111001 0.322309 -0.860522
>>> df.fillna({1:0.5,2:-1})
0 1 2
0 -0.470646 0.500000 -1.000000
1 0.650343 0.500000 -1.000000
2 0.616738 0.500000 -1.000000
3 -0.244354 0.500000 0.863931
4 -0.063218 0.500000 0.978948
5 0.145755 0.630745 1.120421
6 -0.111001 0.322309 -0.860522
fillna会默认返回新对象,但也可以通过设置修改现有对象
>>> _ = df.fillna(0,inplace=True)
>>> df
0 1 2
0 -0.470646 0.000000 0.000000
1 0.650343 0.000000 0.000000
2 0.616738 0.000000 0.000000
3 -0.244354 0.000000 0.863931
4 -0.063218 0.000000 0.978948
5 0.145755 0.630745 1.120421
6 -0.111001 0.322309 -0.860522
对reindex有效的那些方法也可以用于fillna:
>>> df = DataFrame(np.random.randn(6,3))
>>> df.ix[2:,1] = NA;df.ix[4:,2] = NA
>>> df
0 1 2
0 -0.829840 -2.290885 0.258595
1 0.774171 -2.287114 0.873521
2 0.601178 NaN -0.660464
3 0.775981 NaN -1.640218
4 -1.792551 NaN NaN
5 0.411994 NaN NaN
>>> df.fillna(method = 'ffill')
0 1 2
0 -0.829840 -2.290885 0.258595
1 0.774171 -2.287114 0.873521
2 0.601178 -2.287114 -0.660464
3 0.775981 -2.287114 -1.640218
4 -1.792551 -2.287114 -1.640218
5 0.411994 -2.287114 -1.640218
>>> df.fillna(method = 'ffill',limit = 2)
0 1 2
0 -0.829840 -2.290885 0.258595
1 0.774171 -2.287114 0.873521
2 0.601178 -2.287114 -0.660464
3 0.775981 -2.287114 -1.640218
4 -1.792551 NaN -1.640218
5 0.411994 NaN -1.640218
>>> df.fillna(method = 'ffill',limit = 1)
0 1 2
0 -0.829840 -2.290885 0.258595
1 0.774171 -2.287114 0.873521
2 0.601178 -2.287114 -0.660464
3 0.775981 NaN -1.640218
4 -1.792551 NaN -1.640218
5 0.411994 NaN NaN
用每一列平均数填补
>>> df.fillna(df.mean())
0 1 2
0 -0.829840 -2.290885 0.258595
1 0.774171 -2.287114 0.873521
2 0.601178 -2.288999 -0.660464
3 0.775981 -2.288999 -1.640218
4 -1.792551 -2.288999 -0.292142
5 0.411994 -2.288999 -0.292142