python数据预处理案例_Python机器学习经典实例之数据预处理

机器学习之数据预处理

(均值移除、范围缩放、归一化、二值化、独热编码)

参考链接:机器学习之数据预处理(均值移除、范围缩放、归一化、二值化、独热编码、标签编码) - 酱紫煲饭~ - CSDN博客​blog.csdn.net

范例

import numpy as np

from sklearn import preprocessing

数据:data = np.array([[ 3, -1.5, 2, -5.4],

[ 0, 4, -0.3, 2.1],

[ 1, 3.3, -1.9, -4.3]])

1. 均值移除(Mean removal)---- 标准化

[ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]

Std deviation = [ 1. 1. 1. 1.]

# mean removal

data_standardized = preprocessing.scale(data)

print ("\nMean =" and data_standardized.mean(axis=0))

print ("Std deviation =", data_standardized.std(axis=0))

2. 范围缩放(Scaling)

Min max scaled data:

[[ 1. 0. 1. 0. ]

[ 0. 1. 0.41025641 1. ]

[ 0.33333333 0.87272727 0. 0.14666667]]

(缩放到0-1之间)

# min max scaling

data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))

data_scaled = data_scaler.fit_transform(data)

print ("\nMin max scaled data;\n", data_scaled)

3. 归一化(Normalization)

L1 normalized data;

[[ 0.25210084 -0.12605042 0.16806723 -0.45378151]

[ 0. 0.625 -0.046875 0.328125 ]

[ 0.0952381 0.31428571 -0.18095238 -0.40952381]]

(使每个样本的特征值绝对值之和为1)

# normalization

data_normalized = preprocessing.normalize(data, norm='l1')

print ("\nL1 normalized data;\n", data_normalized)

4. 二值化(Binarization)

Binarized data;

[[ 1. 0. 1. 0.]

[ 0. 1. 0. 1.]

[ 0. 1. 0. 0.]]

(设置一个阈值,超过这个值为1,未超过为0)

# binarization

data_binarized = preprocessing.Binarizer(threshold=1.4).transform(data)

print ("\nBinarized data;\n", data_binarized)

5. 独热编码(One-Hot Encoding)

原始矩阵:

[[0, 2, 1, 12],

[1, 3, 5, 3],

[2, 3, 2, 12],

[1, 2, 4, 3]]

结果矩阵:

[[ 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]

(编码过程参考链接)机器学习之One-Hot Encoding详解​www.jianshu.com

# one hot encoding

encoder = preprocessing.OneHotEncoder()

encoder.fit([[0, 2, 1, 12], [1, 3, 5, 3], [2, 3, 2, 12], [1, 2, 4, 3]])

encoded_vector = encoder.transform([[2, 3, 5, 3]]).toarray()

print ("\nEncoded vector:\n", encoded_vector)

2019/1/5

你可能感兴趣的:(python数据预处理案例)