pandas学习笔记(2)——异常值处理

空值的处理:NaN

#先创建一个数据集:(包含空数据)
df = DataFrame(data = np.random.randint(0,150,size = (200,4)),
               columns = ['python','english','math','chinese'])

for i in range(30):
    index = np.random.randint(0,200,size = 1)[0]
    column = np.random.randint(0,4,size = 1)[0]
    df.iloc[index,column] = np.NaN
df

Out[14]: 
     python  english   math  chinese
0     104.0     27.0   86.0    113.0
1     138.0      NaN  132.0    126.0
2       3.0    113.0   37.0     64.0
3      47.0    110.0   32.0    100.0
4      87.0     29.0  126.0    144.0
..      ...      ...    ...      ...
195    63.0     98.0  147.0     18.0
196    26.0      9.0   97.0     10.0
197    57.0    141.0    NaN     37.0
198    52.0     78.0   26.0    143.0
199    79.0     46.0  134.0     86.0


#返回含有空值的行
cond = df.isnull().any(axis = 1)
df[cond]

Out[15]: 
     python  english   math  chinese
1     138.0      NaN  132.0    126.0
5     136.0     21.0  146.0      NaN
12     83.0    108.0    NaN     49.0
17      6.0      NaN   87.0     88.0
20    110.0      NaN   18.0    145.0
21     36.0      NaN  130.0     66.0
36     23.0      NaN   74.0     46.0
37      NaN     29.0   18.0     34.0
38    137.0      NaN   67.0     34.0
40     15.0     85.0   82.0      NaN
46     71.0    147.0   69.0      NaN
65    105.0    105.0   12.0      NaN
78      NaN     56.0    3.0     23.0
83     89.0    101.0  133.0      NaN
94     55.0      3.0    NaN    149.0
99     92.0      NaN  121.0     23.0
105    43.0     36.0  110.0      NaN
108    75.0     91.0    NaN     40.0
110    80.0     36.0   40.0      NaN
119    36.0      NaN   59.0     23.0
120    35.0     80.0    NaN    121.0
128    34.0     18.0   24.0      NaN
130   116.0     12.0    NaN     28.0
142    93.0      2.0    NaN     75.0
156    95.0      0.0    NaN     36.0
170     NaN      8.0   38.0    122.0
173     3.0      NaN   28.0     59.0
175     NaN     70.0   92.0     39.0
182   135.0     59.0  102.0      NaN
197    57.0    141.0    NaN     37.0

#返回所有有效数据(非空数据)
cond = df.notnull().all(axis = 1)
df[cond]

#方法2:
df.dropna()#删除所有有空数据的行

#删除某行或某列
df.drop(labels = ['english'],axis = 1)#删除一列

cond = df.isnull().any(axis = 1)#删除有空数据的行
index = df[cond].index
df.drop(labels = index)

#删除小于60的行
cond = (df<60).any(axis = 1)
index = df[cond].index
df.drop(labels = index)

#找出平均分小于60的行或大于110的

cond1 = df.mean(axis = 1)<60
cond2 = df.mean(axis = 1)>110
cond3 = cond1|cond2#进行或运算,&是与运算

#np.NaN可以参与到计算中,但计算结果总是NaN,None则不能参与到计算中,会报错

#填充空数据
df.fillna(60)
df.fillna(value = df.mean())#用平均值填充
#无论什么样的填充方法,都是假数据,但要让它尽量真
df.fillna(method = 'backfill')
'''
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use next valid observation to fill gap.
'''
#还可以通过算法填充或者局部平均值填充

你可能感兴趣的:(pandas学习笔记(2)——异常值处理)