sklearn(一):数据预处理

1 数值型特征处理

1.1 均值移除(mean removal)

对不同样本的同一特征值进行处理,最终均值为0,标准差为1
import numpy as np
from sklearn import preprocessing

# each column is a sample, and features stack vertically
# i.e, here are 4 examples, each one has 3 features.
data = np.array([[3, -1.5, 2, -5.4],
                [0, 4, -0.3, 2.1],
                [1, 3.3, -1.9, -4.3]])

# mean removal
data_standardized = preprocessing.scale(data, axis=1)
print(data_standardized)
print("\nMean = ", data_standardized.mean(axis=1))
print("Std deviation", data_standardized.std(axis=1))

结果为:

[[ 1.05366545 -0.31079341  0.75045237 -1.49332442]
 [-0.8340361   1.46675314 -1.00659529  0.37387825]
 [ 0.51284962  1.31254733 -0.49546489 -1.32993207]]

Mean =  [ -5.55111512e-17  -1.11022302e-16   0.00000000e+00]
Std deviation [ 1.  1.  1.]

1.2 范围缩放(scale)

对不同样本的同一特征值,减去其最大值,除以(最大值-最小值), 最终原最大值为1,原最小值为0

# scaling
data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_scaled = data_scaler.fit_transform(data)
print(data_scaled)  

结果为:

[[ 1.          0.          1.          0.        ]
 [ 0.          1.          0.41025641  1.        ]
 [ 0.33333333  0.87272727  0.          0.14666667]]

1.3 归一化(normalization)

归一化可以保持数据的正负、比例大小不变,同时可以收缩都范数为1的范围内。

data_normalized_l1 = preprocessing.normalize(data, norm='l1', axis=1)
data_normalized_l2 = preprocessing.normalize(data, norm='l2', axis=1)
print("L1 norm")
print(data_normalized_l1)
print("\n L2 norm")
print(data_normalized_l2)

结果为:

L1 norm
[[ 0.25210084 -0.12605042  0.16806723 -0.45378151]
 [ 0.          0.625      -0.046875    0.328125  ]
 [ 0.0952381   0.31428571 -0.18095238 -0.40952381]]

 L2 norm
[[ 0.45017448 -0.22508724  0.30011632 -0.81031406]
 [ 0.          0.88345221 -0.06625892  0.46381241]
 [ 0.17152381  0.56602858 -0.32589524 -0.73755239]]

1.4 二值化(binarization)

二值化用于数值特征向量转化为布尔类型,

data_binarized = preprocessing.Binarizer(threshold=0.4).transform(data)
print("\nBinarized data:")
print(data_binarized)

结果:

Binarized data:
[[ 1.  0.  1.  0.]
 [ 0.  1.  0.  1.]
 [ 1.  1.  0.  0.]]

当然numpy本身就支持condition index(我自己瞎编的词),也可以直接使用下面的代码,效果相同。

(data<0.4).astype(np.int32)

2 非数值型数据编码

2.1 普通编码

将字符串按照(0~n-1)进行编码

label_encoder = preprocessing.LabelEncoder()
input_classes = ['audi', 'ford', 'toyota', 'ford', 'bwm']
label_encoder.fit(input_classes)
print("\nClass mapping:")
for i, item in enumerate(label_encoder.classes_):
    print(item, "-->", i)

编码器:

Class mapping:
audi --> 0
bwm --> 1
ford --> 2
toyota --> 3

对新数据进行编码:

labels = ['toyota', 'ford', 'audi']
encoded_labels = label_encoder.fit_transform(labels)
print("Labels: ", labels)
print("Encoded Labels: ", encoded_labels)

编码结果为:

Labels:  ['toyota', 'ford', 'audi']
Encoded Labels:  [2 1 0]

解码使用inverse_transform即可

encoded_labels = [2, 1, 0, 3, 1]
decoded_labels = label_encoder.inverse_transform(encoded_labels)
print("Encoded Labels: ", encoded_labels)
print("Decoded Labels: ", decoded_labels)

解码结果为:

Encoded Labels:  [2, 1, 0, 3, 1]
Decoded Labels:  ['ford' 'bwm' 'audi' 'toyota' 'bwm']

2.2 独热编码(one hot)

独热编码用于将非structure data进行编码,确保编码后的数据在常见的欧式空间中距离不变。独热编码的详细介绍可以参照这里。这里需要注意,one-hot是按照列进行编码的

data = np.array([[0, 2, 1, 12],
                 [1, 3, 5, 3],
                 [2, 3, 2, 12],
                 [1, 2, 4, 3]])
encoder = preprocessing.OneHotEncoder()
encoder.fit(data)
encoder_vector = encoder.transform([[2, 3, 5, 3]]).toarray()
print(encoder_vector)

结果为:

[[ 0.  0.  1.  0.  1.  0.  0.  0.  1.  1.  0.]]

你可能感兴趣的:(sklearn(一):数据预处理)