@数据预处理
实际收集的数据往往因为各种原因导致原始数据的不一致(如不同的数据来源,不一样的计量单位), 噪声数据(如采集设备抗干扰能力差、人工输入的错误), 数据缺失、不完整(如问卷填写不完整、采集设备故障)等数据质量问题。
数据质量直接影响建模效果,在正式构建模型之前需要对数据进行恰当的预处理。
sklearn.impute.SimpleImputer
Scikit-learn中缺失值填补函数用法:
SimpleImputer(
missing_values=np.nan, #缺失值的占位符
strategy=‘mean’, #填补策略
fill_value=None, #策略"constant"的常数
verbose=0, #控制Imputer的详细程度
copy=True #True,将创建X的副本;False,填补将在X上进行,有例外!
)
strategy:
“mean”,则使用平均值替换缺失值,一般为连续数据。均值填补使得特征的方差变小。
“median”,则使用中位数替换缺失值。
“most_frequent”,则使用众数替换缺失,一般为离散数据。
“constant”,常数填充。
eg:
#导包
import pandas as pd
from sklearn.impute import SimpleImputer
#读入数据
data=pd.read_csv(“file_path”)
#查看数据缺失情况
data.info()
#实例化SimpleImputer对象imp_mean,imp_most
imp_mean=SimpleImputer(missing_values=np.nan,strategy=“mean”) #策略均值填充
#创建新列接收返回值
data[“new_column1”]=imp_mean(data[[“需要填充的特征名(数值型)”]]) # 返回均值填充后的array
imp_most=SimpleImputer(missing_values=np.nan,strategy=“mean”) #策略众数填充
#创建新列接收返回值
data[“new_column2”]=imp_most(data[[“需要填充的特征名(非数值型)”]]) #返回众数填充后的array
fillna(
value=None,
method=None,
axis=None,
inplace=False,
limit=None,
downcast=None,
)
eg:
data.fillna(method=“ffill”)
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). Values not
in the dict/Series/DataFrame will not be filled. This value cannot
be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use next valid observation to fill gap.
axis : {0 or ‘index’, 1 or ‘columns’}
Axis along which to fill missing values.
inplace : bool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
downcast : dict, default is None
A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
删除法通过删除包含缺失值的数据,来得到一个完整的数据子集。
删除特征:当某个特征缺失值较多,且该特征对数据分析的目标影响不大时, 可以将该特征删除。
删除样本:删除存在数据缺失的样本。该方法适合某些样本有多个特征存在缺失值,且存在缺失值的样本占整个数据集样本数量的比例不高的情形。
pandas对象的方法
dropna(
axis=0,
how=‘any’,
thresh=None,
subset=None,
inplace=False
)
eg:
data.dropna(how=“any”,thresh=“不被删除的最少有效数据”)
how : {‘any’, ‘all’}, default ‘any’
Determine if row or column is removed from DataFrame, when we have
at least one NA or all NA.
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.
thresh : int, optional
Require that many non-NA values.
subset : array-like, optional
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include.
inplace : bool, default False
If True, do operation inplace and return None.
案例链接:
https://blog.csdn.net/it_liujh/article/details/123244631