机器学习之数据预处理
(均值移除、范围缩放、归一化、二值化、独热编码)
参考链接:机器学习之数据预处理(均值移除、范围缩放、归一化、二值化、独热编码、标签编码) - 酱紫煲饭~ - CSDN博客blog.csdn.net
范例
import numpy as np
from sklearn import preprocessing
数据:data = np.array([[ 3, -1.5, 2, -5.4],
[ 0, 4, -0.3, 2.1],
[ 1, 3.3, -1.9, -4.3]])
1. 均值移除(Mean removal)---- 标准化
[ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]
Std deviation = [ 1. 1. 1. 1.]
# mean removal
data_standardized = preprocessing.scale(data)
print ("\nMean =" and data_standardized.mean(axis=0))
print ("Std deviation =", data_standardized.std(axis=0))
2. 范围缩放(Scaling)
Min max scaled data:
[[ 1. 0. 1. 0. ]
[ 0. 1. 0.41025641 1. ]
[ 0.33333333 0.87272727 0. 0.14666667]]
(缩放到0-1之间)
# min max scaling
data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = data_scaler.fit_transform(data)
print ("\nMin max scaled data;\n", data_scaled)
3. 归一化(Normalization)
L1 normalized data;
[[ 0.25210084 -0.12605042 0.16806723 -0.45378151]
[ 0. 0.625 -0.046875 0.328125 ]
[ 0.0952381 0.31428571 -0.18095238 -0.40952381]]
(使每个样本的特征值绝对值之和为1)
# normalization
data_normalized = preprocessing.normalize(data, norm='l1')
print ("\nL1 normalized data;\n", data_normalized)
4. 二值化(Binarization)
Binarized data;
[[ 1. 0. 1. 0.]
[ 0. 1. 0. 1.]
[ 0. 1. 0. 0.]]
(设置一个阈值,超过这个值为1,未超过为0)
# binarization
data_binarized = preprocessing.Binarizer(threshold=1.4).transform(data)
print ("\nBinarized data;\n", data_binarized)
5. 独热编码(One-Hot Encoding)
原始矩阵:
[[0, 2, 1, 12],
[1, 3, 5, 3],
[2, 3, 2, 12],
[1, 2, 4, 3]]
结果矩阵:
[[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
(编码过程参考链接)机器学习之One-Hot Encoding详解www.jianshu.com
# one hot encoding
encoder = preprocessing.OneHotEncoder()
encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4, 3]])
encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print ("\nEncoded vector:\n", encoded_vector)
2019/1/5