在数据分析中,80%的时间是用于数据清理。
1.对于数值型数据,pandas用用浮点值NaN来表示缺失值,称之为标识符,这种值能被轻易检测到。
在pandas中,把缺失值表示NA
内建的python None也表示缺失值NA
import numpy as np
import pandas as pd
obj=pd.Series(['li','xun',np.nan,'big'])
print(obj)
print(obj.isnull())
obj[0]=None
print(obj.isnull())
#输出:
0 li
1 xun
2 NaN
3 big
dtype: object
0 False
1 False
2 True
3 False
dtype: bool
0 True
1 False
2 True
3 False
dtype: bool
2.过滤缺失值
过滤缺失值的一些方法。可以使用pandas的isnull来标记出数据的为True的 Boolean indexing,然后配合函数dropna.对于Series过滤,会返回它的非空数据和标签值index values
from numpy import nan as NA #相当于将np.nan重新命名为NA
obj1=pd.Series([1,NA,3,NA,5])
print(obj1)
print(obj1.dropna())
#输出:
0 1.0
1 NaN
2 3.0
3 NaN
4 5.0
dtype: float64
0 1.0
2 3.0
4 5.0
dtype: float64
上面等同于obj1[obj1.notnull()]
3.对于DataFrame,会想删除包含NA的rows 和columns。dropna会默认删除包含缺失值的row
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
[NA, NA, NA], [NA, 6.5, 3.]])
print(data)
cleaned=data.dropna()
print(cleaned)
#在dropna()中传入参数how='all',只会删除所有值为NA的行
print(data.dropna(how='all'))
data[4]=NA #将data的第4列全部赋值为NA
print(data)
print(data.dropna(axis='columns',how='all')) #删除列中所有制为NA的列
data.loc[4,]=NA #将第4行全部复制为NA
print(data)
#输出:
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
0 1 2
0 1.0 6.5 3.0
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
3 NaN 6.5 3.0
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
0 1 2 4
0 1.0 6.5 3.0 NaN
1 1.0 NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN 6.5 3.0 NaN
4 NaN NaN NaN NaN
4.如果你想删除dropna() 行 与 列中特定数值NA,可以用thresh为参数
dropna(thresh=2) 删除行中有两个NA的行
df=pd.DataFrame(np.random.randn(7,3))
print(df)
df.iloc[:4,1]=NA #将df的第0行至第3行的第1列数据复制为NA,因为索引是数值型4,所以不会包括尾节点,只有字符型才会包括尾节点
print(df)
df.iloc[:2,2]=NA
print(df)
print(df.dropna()) #dropna默认会删除包含NA的所有行
#输出:
0 1 2
0 0.909207 -0.232940 0.641159
1 -1.391261 0.303475 -0.520221
2 0.609855 1.111600 0.130441
3 -1.063254 0.133358 -1.180378
4 -0.114915 -1.690312 -1.499811
5 0.577400 0.176858 -0.383826
6 0.075906 2.098203 0.897373
0 1 2
0 0.909207 NaN 0.641159
1 -1.391261 NaN -0.520221
2 0.609855 NaN 0.130441
3 -1.063254 NaN -1.180378
4 -0.114915 -1.690312 -1.499811
5 0.577400 0.176858 -0.383826
6 0.075906 2.098203 0.897373
0 1 2
0 0.909207 NaN NaN
1 -1.391261 NaN NaN
2 0.609855 NaN 0.130441
3 -1.063254 NaN -1.180378
4 -0.114915 -1.690312 -1.499811
5 0.577400 0.176858 -0.383826
6 0.075906 2.098203 0.897373
0 1 2
4 -0.114915 -1.690312 -1.499811
5 0.577400 0.176858 -0.383826
6 0.075906 2.098203 0.897373
5.填补缺失值
df.fillna(0) #将df中的所有缺失值填补为0
#输出:
0 1 2
0 -1.239803 NaN NaN
1 0.895746 NaN NaN
2 0.744441 NaN -0.409287
3 0.774851 NaN 1.236767
4 0.870785 0.197317 1.020200
5 1.418123 -0.536136 -0.416445
6 -0.162113 0.604414 0.761549
0 1 2
0 -1.239803 0.000000 0.000000
1 0.895746 0.000000 0.000000
2 0.744441 0.000000 -0.409287
3 0.774851 0.000000 1.236767
4 0.870785 0.197317 1.020200
5 1.418123 -0.536136 -0.416445
6 -0.162113 0.604414 0.761549
#给fillna()函数传入一个字典参数,可以给不同的列的缺失值填补为不同的值
print(df.fillna({1:0.5,2:0}))
#输出:
0 1 2
0 -0.421190 0.500000 0.000000
1 0.872530 0.500000 0.000000
2 0.037437 0.500000 -0.531982
3 2.689439 0.500000 -0.844821
4 0.875314 -0.474378 -0.066359
5 -0.299708 1.298735 1.332805
6 2.090344 0.012999 -0.356001
#因为df.fillna()返回的是一个新对象,如果不想返回一个新对象,而是直接更改数据,那么我们可以在参数里加入inplace
df1=df.fillna(0,inplace=True)
print(df)
#输出:
0 1 2
0 -0.569080 0.000000 0.000000
1 -0.384909 0.000000 0.000000
2 0.117809 0.000000 -2.176390
3 -0.330230 0.000000 -0.758827
4 0.307959 -1.003411 -1.241733
5 1.454243 -0.814340 1.009244
6 0.276313 -1.514511 -0.193618
#使用iloc,iloc会直接更改df2的值,值被更改后,就会一直保留,比如下面的两个例子
df2=pd.DataFrame(np.random.randn(6,3))
#print(df2)
df2.iloc[2:,1]=NA
print(df2)
df2.iloc[4:,2]=NA
print(df2)
#输出:
0 1 2
0 -0.055935 -0.636866 0.062053
1 -0.179243 1.571229 -0.420884
2 -0.125707 NaN -1.021118
3 -2.442601 NaN 0.293664
4 -0.060766 NaN 0.101210
5 -1.287985 NaN -0.135381
0 1 2
0 -0.055935 -0.636866 0.062053
1 -0.179243 1.571229 -0.420884
2 -0.125707 NaN -1.021118
3 -2.442601 NaN 0.293664
4 -0.060766 NaN NaN
5 -1.287985 NaN NaN
#使用fillna填补缺失值,每列的缺失值填补为一样.
print(df2.fillna(method='ffill'))
#使用limit来限制填补的缺失值为多少行
print(df2.fillna(method='ffill',limit=3))
print(df2.fillna(method='ffill',limit=2))
#输出:
0 1 2
0 -0.066499 0.230082 0.451753
1 2.258538 -1.015993 -0.305535
2 0.986370 -1.015993 -0.733767
3 -0.620577 -1.015993 0.053684
4 0.525918 -1.015993 0.053684
5 -0.359719 -1.015993 0.053684
0 1 2
0 -0.066499 0.230082 0.451753
1 2.258538 -1.015993 -0.305535
2 0.986370 -1.015993 -0.733767
3 -0.620577 -1.015993 0.053684
4 0.525918 -1.015993 0.053684
5 -0.359719 NaN 0.053684
0 1 2
0 -0.066499 0.230082 0.451753
1 2.258538 -1.015993 -0.305535
2 0.986370 -1.015993 -0.733767
3 -0.620577 -1.015993 0.053684
4 0.525918 NaN 0.053684
5 -0.359719 NaN 0.053684
使用fillna可以我们做一些颇有创造力的事情.比如,可以传入一个Series的平均值和中位数
data=pd.Series([1.,NA,3.5,NA,7])
#这里就是将Series的平均值作为填补值去填补缺失值
print(data.fillna(data.mean()))
#输出:
0 1.000000
1 3.833333
2 3.500000
3 3.833333
4 7.000000
突然发现一周没更新博客了,哈哈,好像突然又违背了给自己定的目标。