数据预处理一般包括:
(1) 数据标准化
这是最常用的数据预处理,把某个特征的所有样本转换成均值为0,方差为1。
将数据转换成标准正态分布的方法:
对每维特征单独处理:
其中,
可以调用sklearn.preprocessing中的StandardScaler()进行数据的标准化。
(2) 数据归一化
把某个特征的所有样本取值限定在规定范围内(一般为[-1,1]或者[0,1])。
归一化得方法为:
可以调用sklearn.preprocessing中的MinMaxScaler()将数据限定在[0,1]范围,调用MaxAbsScaler()将数据限定在[-1,1]范围。
(3) 数据正规化
把某个特征的所有样本的模长转换为1。方法为:
可以调用sklearn.preprocessing中的Normalizer()实现
(4) 数据二值化
把数据的特征取值根据阈值转为为0或者1。
(5) 数据缺值处理
对于缺失的特征数据,进行数据填补,一般填补的方法有:均值,中位数,众数填补等。
(6) 数据离群点处理
删除离群点数据。
(7) 数据类型转换
如果数据的特征不是数值型特征,则需要转换为数值型。
数据处理工具包为:Numpy,SciPy,pandas,其中SciPy,pandas是基于Numpy进一步的封装
数据可视化工具包为:Matplotlib,Seaborn,其中Seaborn是基于Matplotlib进一步的封装
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score
%matplotlib inline
dpath = './data/'
data = pd.read_csv(dpath +"boston_housing.csv")
data.head()
data.info()
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM 506 non-null float64
ZN 506 non-null int64
INDUS 506 non-null float64
CHAS 506 non-null int64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null int64
TAX 506 non-null int64
PTRATIO 506 non-null int64
B 506 non-null float64
LSTAT 506 non-null float64
MEDV 506 non-null float64
dtypes: float64(9), int64(5)
memory usage: 55.4 KB
labels : single label or list-like
axis : int or axis name
level : int or level name, default None For MultiIndex
inplace : bool, default False. If True, do operation inplace and return None.
errors : {‘ignore’, ‘raise’}, default ‘raise’,If‘ignore, suppress error and existing labels are dropped.
Returns: dropped : type of caller
y = data['MEDV'] # 获取列名为'MEDV'的列的数据
#print y
X = data.drop('MEDV', axis=1) # 从axis=1轴(列)中删去列名为'MEDV'的列
X.info()
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM 506 non-null float64
ZN 506 non-null int64
INDUS 506 non-null float64
CHAS 506 non-null int64
NOX 506 non-null float64
RM 506 non-null float64
AGE 506 non-null float64
DIS 506 non-null float64
RAD 506 non-null int64
TAX 506 non-null int64
PTRATIO 506 non-null int64
B 506 non-null float64
LSTAT 506 non-null float64
dtypes: float64(8), int64(5)
memory usage: 51.5 KB
*arrays : sequence of indexables with same length / shape[0]
Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
test_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the
test split. If int, represents the absolute number of test samples. If None, the value is automatically
set to the complement of the train size. If train size is also None, test size is set to 0.25.
train_size : float, int, or None (default is None)。
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include
in the train split. If int, represents the absolute number of train samples. If None, the value is
automatically set to the complement of the test size.
random_state : int or RandomState。Pseudo-random number generator state used for random sampling.
stratify : array-like or None (default is None)
X:输入特征,
y:输入标签,
random_state:随机种子,
test_size:测试样本数占比,为默认为0.25
[X_train, y_train] 和 [X_test, y_test]是一对,分别对应分割之后的训练数据和训练标签,测试数据和训练标签
from sklearn.cross_validation import train_test_split
# 随机采样25%的数据构建测试样本,其余作为训练样本
# X:输入特征,y:输入标签,random_state随机种子为27, test_size:测试样本数占比,如果train_size=NULL,则为默认的0.25
# 输出为训练样本和测试样本的DataFrame数据
X_train,X_test,y_train, y_test = train_test_split(X, y, random_state=27, test_size=0.25)
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape
(379, 13)
(379L,)
(127, 13)
(127L,)
数据标准化:
with_mean : boolean, True by default.If True, center the data before scaling.
with_std : boolean, True by default.If True, scale the data to unit variance (or equivalently, unit
standard deviation).
copy : boolean, optional, default True.If False, try to avoid a copy and do inplace scaling instead.
X : numpy array of shape [n_samples, n_features].Training set.
y : numpy array of shape [n_samples].Target values.
X_new : numpy array of shape [n_samples, n_features_new].Transformed array.
X : array-like, shape [n_samples, n_features].The data used to scale along the features axis.
X_new : numpy array of shape [n_samples, n_features_new].Transformed array.
# 数据标准化
from sklearn.preprocessing import StandardScaler
# 分别初始化对特征和目标值的标准化器
ss_X = StandardScaler()
ss_y = StandardScaler()
# 分别对训练和测试数据的特征以及目标值进行标准化处理
X_train = ss_X.fit_transform(X_train) # 先计算均值和方差,再进行变换
X_test = ss_X.transform(X_test) # 利用上面计算好的均值和方差,直接进行转换
y_train = ss_y.fit_transform(y_train)
y_test = ss_y.transform(y_test)
print X_train
[[-0.37683627 -0.50304409 2.48277286 ..., 0.86555269 -0.13431739
1.60921499]
[ 5.13573477 -0.50304409 1.0607873 ..., 0.86555269 -2.93693892
3.44576006]
[-0.37346431 0.01751212 -0.44822848 ..., -1.31269744 0.33223834
2.45055308]
...,
[-0.39101613 -0.50304409 -1.13119458 ..., -0.87704742 0.28632785
-0.36708256]
[-0.38897021 -0.50304409 -1.2462515 ..., -0.44139739 0.38012111
0.19898553]
[-0.31120842 -0.50304409 -0.40840109 ..., 1.30120272 0.37957325
-0.18215757]]
转自:http://www.cnblogs.com/tan-v