pandas用众数填充缺失值_数据处理之缺失值填充

点赞、关注再看,养成良好习惯

Life is short, U need Python

初学Python,快来点我吧

1. 概述

首先对数据缺失的原因、类型以及处理方法做一个简单地总结,如下图所示:

2. 直接删除法

当缺失值的个数只占整体很小一部分的时候,可直接删除缺失值(行)。但是如果缺失值占比比较大,这种直接删除缺失值的处理方法就会丢失重要信息。

直接删除法处理缺失值时,需要检测样本总体中缺失值的个数。Python中统计缺失值的方法如下(下面结合具体数据集,直接上代码):

import numpy as np

import pandas as pd

data = pd.read_csv('1.csv') # 需要具体数据(公开的海藻数据集)请留言,并附上邮箱!

data.head()

null_all = data.isnull().sum() # 检测缺失值个数(方法1)

null_all

data.info() # 检测缺失值个数(方法2)

# new_data = data.dropna() # 1--删除存在缺失值的行

# new_data = data.dropna(subset=['C1','Chla']) # 2--删除指定列存在缺失值的行

new_data = data.dropna(thresh=15) # 3--删除行属性值不足k个的行(即删除缺失元素比较多的行-->n-15)

new_data.info()

3. 前填充/后填充

import numpy as np

import pandas as pd

data = pd.read_csv('1.csv')

data[50:60] # 展示缺失值情况

data = data.fillna(method='ffill') # ffill---前填充;bfill--后填充

data[50:60]

4. 均值、众数、中位数填充

通常可以根据样本之间的相似性(中心趋势)填补缺失值,通常使用能代表变量中心趋势的值进行填补,代表变量中心趋势的指标包括 平均值(mean)、中位数(median)、众数(mode) 等,那么我们采用哪些指标来填补缺失值呢?

(4.1)方法一(.fillna())

import numpy as np

import pandas as pd

data = pd.read_csv('1.csv')

data['C1'] = data['C1'].fillna(data['C1'].mean()) # 均值填充:.mean()--->.median()--->.mode()

data[50:60]

注:当使用众数进行填充时,需特别注意众数不存在或者多于一个的情况!

(4.2)方法二(SimpleImputer)

SimpleImputer 提供了缺失数值处理的基本策略,比如使用缺失数值所在行或列的均值、中位数、众数来替代缺失值。

import numpy as np

import pandas as pd

data = pd.read_csv('1.csv')

# from sklearn.preprocessing import Imputer # scikit-learn (较早版本)

from sklearn.impute import SimpleImputer # scikit-learn 0.22.2(最新版)

imputer = SimpleImputer(strategy='mean')

imputer = imputer.fit(data.iloc[:,3:].values)

imputer_data = pd.DataFrame(imputer.transform(data.iloc[:,3:].values),columns=data.columns[3:])

imputer_data[53:64]

5. 插值法

interpolate() 插值法,计算的是缺失值前一个值和后一个值的平均数。

import numpy as np

import pandas as pd

data = pd.read_csv('1.csv')

data['C1'] = data_5['C1'].interpolate()

data[53:63]

6. KNN填充(均值)

为了实现KNN填充,我们先通过其他方法处理缺失值比较少的数据(因为该方法必须借助于其他非缺失数据寻找最邻近的数据,然后进行加权平均求值填充的),得到如下特征数据:

(6.1)from fancyimpute import KNN

首先需要安装第三方包----fancyimpute,此包的安装比较费劲(尤其在windows下,这里的坑有点深啊)!

【1】包的下载(7包+1软件)

链接:https://pan.baidu.com/s/1CUfiaEyE-k4G560L2JsOYQ

提取码:nriv

提示:此包需要与Python版本保持一致(Python3.6),其它版本点击我下载!

【2】包的安装(for windows)

pip install D:\fancyimpute\包1

pip install D:\fancyimpute\包2

pip install D:\fancyimpute\包3

pip install D:\fancyimpute\包4

pip install D:\fancyimpute\包5

pip install D:\fancyimpute\包6

pip install D:\fancyimpute\fancyimpute-0.5.4.tar.gz

【3】可能会出现如下报错

ERROR: tensorflow 2.1.0 has requirement scipy==1.4.1; python_version >= “3”, but you’ll have scipy 1.1.0 which is incompatible.

ERROR: tensorflow 2.1.0 has requirement six>=1.12.0, but you’ll have six 1.11.0 which is incompatible.

ERROR: Cannot uninstall ‘wrapt’. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.

【4】不要怕,继续往下安装

pip install --upgrade scipy==1.4.1

pip install --upgrade six==1.12.0

pip install wrapt --ignore-installed

【5】哇塞,咋还有可能出错

ImportError: Could not find the DLL(s) ‘msvcp140_1.dll’

【6】没办法,继续安装

继续安装网盘下载中的vc_redist.x64.exe 就好了

哎,终于可以继续写代码了!!!

data = pd.read_csv('1.csv')

# 插值填充缺失值少的特征

data['mxPH'] = data['mxPH'].interpolate()

data['MNO2'] = data['MNO2'].interpolate()

data['NO3'] = data['NO3'].interpolate()

data['NH4'] = data['NH4'].interpolate()

data['Opo4'] = data['Opo4'].interpolate()

data['PO4'] = data['PO4'].interpolate()

data.info()

new_data = data.iloc[:,3:11]

new_data[53:64]

from fancyimpute import KNN # 事先安装:fancyimpute

fill_knn = KNN(k=3).fit_transform(new_data)

new_data = pd.DataFrame(fill_knn,columns=data.columns[3:11])

new_data[53:64]

(6.2)from sklearn.neighbors import KNeighborsRegressor

data = pd.read_csv('1.csv')

# 插值填充缺失值少的特征

data['mxPH'] = data['mxPH'].interpolate()

data['MNO2'] = data['MNO2'].interpolate()

data['NO3'] = data['mxPH'].interpolate()

data['NH4'] = data['MNO2'].interpolate()

data['Opo4'] = data['mxPH'].interpolate()

data['PO4'] = data['MNO2'].interpolate()

C1_data = data[['mxPH','MNO2', 'NO3', 'NH4', 'Opo4', 'PO4', 'C1']]

C1_data[53:64]

known_C1 = C1_data[C1_data.C1.notnull()]

unknown_C1 = C1_data[C1_data.C1.isnull()]、

import numpy as np

y = known_C1.iloc[:, 6]

y_train = np.array(y)

X = known_C1.iloc[:, :6]

X_train = np.array(X)

X_test = np.array(unknown_C1.iloc[:, :6])

y_test = np.array(unknown_C1.iloc[:, 6])

from sklearn.neighbors import KNeighborsRegressor

clf = KNeighborsRegressor(n_neighbors = 6, weights = "distance").fit(X_train,y_train)

y_test = clf.predict(X_test)

y_test

7. 随机森林填充

上面啰嗦了很多了,直接上代码吧!

data = pd.read_csv('1.csv')

data.mxPH = data.mxPH.fillna(data.mxPH.mean())

data.MNO2 = data.MNO2.fillna(data.MNO2.mean())

C1_data = data[['mxPH','MNO2', 'C1']]

C1_data[53:64]

known_C1 = C1_data[C1_data.C1.notnull()]

unknown_C1 = C1_data[C1_data.C1.isnull()]

import numpy as np

y = known_C1.iloc[:, 2]

y = np.array(y)

X = known_C1.iloc[:, :2]

X = np.array(X)

from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(random_state=0, n_estimators=200, n_jobs=-1)

rfr.fit(X, y)

data.loc[(data.C1.isnull()), 'C1'] = rfr.predict(unknown_C1.iloc[:, :2])

data[53:64]

8. 小结

暂时先写到这吧,有点累了,休息!!(后续再继续补充新方法)

类似的下一篇准备介绍一下数据特征工程的一些方法,让我们一起期待吧!!

写作不易,切勿白剽

博友们的点赞和关注就是对博主坚持写作的最大鼓励

持续更新,未完待续…

你可能感兴趣的:(pandas用众数填充缺失值)