官方文档:http://lijiancheng0614.github.io/scikit-learn/modules/generated/sklearn.preprocessing.StandardScaler.html
Github: https://github.com/scikit-learn/scikit-learn.git
LearnCode/scikit-learn/sklearn/preprocessing/data.py
classStandardScaler
说明:
class StandardScaler(BaseEstimator, TransformerMixin):
// 通过除去平均值和缩放单位化方差 来 标准化特性
"""Standardize features by removing the mean and scaling to unit variance
// 定中心和缩放发生独立在每个特征中,通过计算训练集中样本的相关性统计。
// 然后平均值和标准差将存储起来,在后续数据转换方法中被使用
Centering and scaling happen independently on each feature by computing
the relevant statistics on the samples in the training set. Mean and
standard deviation are then stored to be used on later data using the
`transform` method.
// 在许多机器学习预测中,标准化是数据集普通的需求:
// 如果个别特征与标准正太分布数据有差异或不相识,他们可能表现的很差
// (比如:高斯分布的0均值和单位方差)
Standardization of a dataset is a common requirement for many
machine learning estimators: they might behave badly if the
individual feature do not more or less look like standard normally
distributed data (e.g. Gaussian with 0 mean and unit variance).
// RBF: Radial Basis Function 径向基函数
// 例如,在学习算法的目标函数中使用的许多元素
// (比如径向基函数中支持向量机的核 或 线性模型中L1 和 L2的正则化)
// 假设所有特征都集中在0左右,并有相同的方差
// 它将会主导目标函数使得难从其他特征学习到预期的正确预测结果
For instance many elements used in the objective function of
a learning algorithm (such as the RBF kernel of Support Vector
Machines or the L1 and L2 regularizers of linear models) assume that
all features are centered around 0 and have estimator in the same
order. If a feature has a variance that is orders of magnitude larger
that others, it might dominate the objective function and make the
estimator unable to learn from other features correctly as expected.
// 这标准化处理同样可以应用在稀疏矩阵(CSC :压缩稀疏列、CSR :压缩稀疏行)
// 通过设置参数`with_mean=False` 避免破坏数据的稀疏结构
This scaler can also be applied to sparse CSR or CSC matrices by passing
`with_mean=False` to avoid breaking the sparsity structure of the data.
Read more in the :ref:`User Guide
Parameters
----------
with_mean : boolean, True by default
If True, center the data before scaling.
This does not work (and will raise an exception) when attempted on
sparse matrices, because centering them entails building a dense
matrix which in common use cases is likely to be too large to fit in
memory.
with_std : boolean, True by default
If True, scale the data to unit variance (or equivalently,
unit standard deviation).
copy : boolean, optional, default True
If False, try to avoid a copy and do inplace scaling instead.
This is not guaranteed to always work inplace; e.g. if the data is
not a NumPy array or scipy.sparse CSR matrix, a copy may still be
returned.
Attributes
----------
scale_ : ndarray, shape (n_features,)
Per feature relative scaling of the data.
.. versionadded:: 0.17
*scale_* is recommended instead of deprecated *std_*.
mean_ : array of floats with shape [n_features]
The mean value for each feature in the training set.
var_ : array of floats with shape [n_features]
The variance for each feature in the training set. Used to compute
`scale_`
n_samples_seen_ : int
The number of samples processed by the estimator. Will be reset on
new calls to fit, but increments across ``partial_fit`` calls.
See also
--------
:func:`sklearn.preprocessing.scale` to perform centering and
scaling without using the ``Transformer`` object oriented API
:class:`sklearn.decomposition.RandomizedPCA` with `whiten=True`
to further remove the linear correlation across features.
"""
def __init__(self, copy=True, with_mean=True, with_std=True):
self.with_mean = with_mean
self.with_std = with_std
self.copy = copy