数据标准化StandardScaler简介

官方文档:http://lijiancheng0614.github.io/scikit-learn/modules/generated/sklearn.preprocessing.StandardScaler.html

Github: https://github.com/scikit-learn/scikit-learn.git

LearnCode/scikit-learn/sklearn/preprocessing/data.py

classStandardScaler

 

说明:

class StandardScaler(BaseEstimator, TransformerMixin):

// 通过除去平均值和缩放单位化方差 来 标准化特性

    """Standardize features by removing the mean and scaling to unit variance

 

// 定中心和缩放发生独立在每个特征中,通过计算训练集中样本的相关性统计。

// 然后平均值和标准差将存储起来,在后续数据转换方法中被使用

    Centering and scaling happen independently on each feature by computing

    the relevant statistics on the samples in the training set. Mean and

    standard deviation are then stored to be used on later data using the

    `transform` method.

 

// 在许多机器学习预测中,标准化是数据集普通的需求:

// 如果个别特征与标准正太分布数据有差异或不相识,他们可能表现的很差

// (比如:高斯分布的0均值和单位方差)

    Standardization of a dataset is a common requirement for many

    machine learning estimators: they might behave badly if the

    individual feature do not more or less look like standard normally

    distributed data (e.g. Gaussian with 0 mean and unit variance).

 

// RBF: Radial Basis Function 径向基函数

// 例如,在学习算法的目标函数中使用的许多元素

// (比如径向基函数中支持向量机的核 或 线性模型中L1 和 L2的正则化)

// 假设所有特征都集中在0左右,并有相同的方差

// 它将会主导目标函数使得难从其他特征学习到预期的正确预测结果

    For instance many elements used in the objective function of

    a learning algorithm (such as the RBF kernel of Support Vector

    Machines or the L1 and L2 regularizers of linear models) assume that

    all features are centered around 0 and have estimator  in the same

    order. If a feature has a variance that is orders of magnitude larger

    that others, it might dominate the objective function and make the

    estimator unable to learn from other features correctly as expected.

 

// 这标准化处理同样可以应用在稀疏矩阵(CSC :压缩稀疏列、CSR :压缩稀疏行)

// 通过设置参数`with_mean=False` 避免破坏数据的稀疏结构

    This scaler can also be applied to sparse CSR or CSC matrices by passing

    `with_mean=False` to avoid breaking the sparsity structure of the data.

 

    Read more in the :ref:`User Guide `.

 

    Parameters

    ----------

    with_mean : boolean, True by default

        If True, center the data before scaling.

        This does not work (and will raise an exception) when attempted on

        sparse matrices, because centering them entails building a dense

        matrix which in common use cases is likely to be too large to fit in

        memory.

 

    with_std : boolean, True by default

        If True, scale the data to unit variance (or equivalently,

        unit standard deviation).

    copy : boolean, optional, default True

        If False, try to avoid a copy and do inplace scaling instead.

        This is not guaranteed to always work inplace; e.g. if the data is

        not a NumPy array or scipy.sparse CSR matrix, a copy may still be

        returned.

 

    Attributes

    ----------

    scale_ : ndarray, shape (n_features,)

        Per feature relative scaling of the data.

 

        .. versionadded:: 0.17

           *scale_* is recommended instead of deprecated *std_*.

 

    mean_ : array of floats with shape [n_features]

        The mean value for each feature in the training set.

 

    var_ : array of floats with shape [n_features]

        The variance for each feature in the training set. Used to compute

        `scale_`

 

    n_samples_seen_ : int

        The number of samples processed by the estimator. Will be reset on

        new calls to fit, but increments across ``partial_fit`` calls.

 

    See also

    --------

    :func:`sklearn.preprocessing.scale` to perform centering and

    scaling without using the ``Transformer`` object oriented API

 

    :class:`sklearn.decomposition.RandomizedPCA` with `whiten=True`

    to further remove the linear correlation across features.

    """

 

    def __init__(self, copy=True, with_mean=True, with_std=True):

        self.with_mean = with_mean

        self.with_std = with_std

        self.copy = copy

 

你可能感兴趣的:(机器学习)