sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化

一. 数据的标准化与归一化(zero-mean normalization): class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)

官方文档-StandardScaler

  • standard score(z) of a sample x: z = (x - u) / s
    u: the mean of training samples (u = 0 if with_mean = False)
    s: the standard deviation of the training samples (s = 1 if with_std = False)
    sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化_第1张图片

  • Parameters and Attributes:
    sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化_第2张图片
    例子:

from sklearn.preprocessing import StandardScaler
data = [[0, 0], [0, 0], [1, 1], [1, 1]]
scaler = StandardScaler()
print(scaler.fit(data))

output: StandardScaler()

print(scaler.mean_)
print(scaler.var_)

output:
array([0.5, 0.5])
array([0.25, 0.25])

其中scaler.fit(data),即StandardScaler.fit(data)计算出数据的平均值和标准差,并存储在StandardScaler()中便于之后的使用;
调用attributes中的mean_和var_求数据的平均值和方差.
除了fit()之外,StandardScaler()还有许多不同的methods:

  • Popular Methods:
  1. fit(): compute the mean and std to be used for later scaling
  2. fit_transform(): fit to data, then transform it. 即计算出mean和std,并对数据作转换,从而变成标准的正态分布
  3. transform(): perform standardization by centering and scaling. 即只进行转换成标准正态分布的操作

注意:

  1. 一般来说先使用fit,再使用transform
  2. 将训练集和测试集放在一起做标准化;或者在训练集上作标准化后,用相同的标准化器(scaler())去标准化测试集 ,必须用同一个scaler进行transform

例:

data = [[0, 0], [0, 0], [1, 1], [1, 1]]
# 基于mean和std的标准化
scaler = preprocessing.StandardScaler().fit(train_data)
scaler.transform(train_data)
scaler.transform(test_data)

二. 最小最大规范化(min-max normalization) class sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), *, copy=True)

官方文档-MinMaxScaler

  • min-max normalization: transform features by scaling each feature to a given range.

sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化_第3张图片

  • The transform is given by:
    X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
    X_scaled = X_std * (max - min) + min

    where min, max = feature_range. (default: max = 1, min = 0)

  • Parameters and Attributes:
    sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化_第4张图片

  • Methods:

  1. fit(): compute the min and max for later scaling
  2. transform(): scale features according to feature_range
  3. fit_transform(): fit to data, then transform it.

例:

  1. 基于所给的数据创建最小最大转换器
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))

output:
MinMaxScaler()

  1. 根据attribute中data_max_求出转换前数据的最大值
print(scaler.data_max_)

output: [ 1. 18.]

  1. 对数据集进行最小最大规范化
print(scaler.transform(data))

output:
[[0. 0. ]
[0.25 0.25]
[0.5 0.5 ]
[1. 1. ]]


三. 正则化/归一化(normalization)

  • normalization: Scale input vectors individually to unit norm (vector length)
    首先求出样本的p范数,然后该样本的所有元素都要除以该范数,这样最终使得每个样本的范数都是1。规范化(Normalization)是将不同变化范围的值映射到相同的固定范围,常见的是[0,1],也成为归一化。
  • 函数:sklearn.preprocessing.normalize(X, norm=‘l2’, , axis=1, copy=True, return_norm=False)
  • 类:sklearn.preprocessing.Normalizer(norm=‘l2’, , copy=True)
  1. sklearn.preprocessing.normalize(X, norm=‘l2’, , axis=1, copy=True, return_norm=False)
  • Parameters and Returns:
    sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化_第5张图片

例:

import sklearn.preprocessing
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = sklearn.preprocessing.normalize(X, norm='l2')
print(X_normalized)

output:
[[ 0.40824829 -0.40824829 0.81649658]
[ 1. 0. 0. ]
[ 0. 0.70710678 -0.70710678]]

  1. sklearn.preprocessing.Normalizer(norm=‘l2’, , copy=True)
  • Parameters: sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化_第6张图片

例:

from sklearn.preprocessing import Normalizer
normalizer = Normalizer().fit(X)#fit method is useless in this case
print(normalizer)

output: Normalizer()

print(normalizer.transform(X))

output:
array([[ 0.40824829, -0.40824829, 0.81649658],
[ 1. , 0. , 0. ],
[ 0. , 0.70710678, -0.70710678]])

和直接利用函数normalize结果相同。


四. 特征二值化(Binarization)class sklearn.preprocessing.Binarizer(*, threshold=0.0, copy=True)

Binarize data (set feature values to 0 or 1) according to a threshold
Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

  • Parameters:
    sklearn.preprocessing()详解: 标准化、正则化、最小最大规范化、特征二值化_第7张图片

例子:

from sklearn.preprocessing import Binarizer
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
transformer = Binarizer().fit(X) #fit does nothing
print(transformer)

output: Binarizer()

transformer.transform(X)

output:
array([[1., 0., 1.],
[1., 0., 0.],
[0., 1., 0.]])


参考:
几种常用的数据标准化方法

有关transform()和fit_transform

你可能感兴趣的:(Sklearn基础)