此博客引用部分来源于:https://sebastianraschka.com/faq/docs/scale-training-test.html
注意:引用部分是英文原版,翻译部分是我根据自己的计算和理解之后翻译的,有一些不同。
Why do we need to re-use training parameters to transform test data?
翻译:为什么需要在训练集上得出参数并重新使用它们来缩放 (scale) 测试集?这篇文章是讨论在使用标准化 (StandardScaler()) 时,为什么要在训练集上 (X_train) 使用fit_transform(),但是在测试集 (X_test) 上使用transform()?
即如下所示代码:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
// 在训练集上使用fit_transform()
scaler.fit_transform(X_train)
// 在测试集上使用transform()
scaler.transform(X_test)
In practice, I’ve seen many ways for scaling a dataset prior to feeding it to a learning algorithm. Can you guess which one is “correct?”
翻译:以下有三种对数据集进行缩放的方法,哪种是正确的?
方法 1:
scaled_dataset = (dataset - dataset_mean) /dataset_std_deviation train, test = split(scaled_dataset)
方法 2:
train, test = split(dataset) scaled_train = (train - train_mean) / train_std_deviation scaled_test = (test - test_mean) / test_std_deviation ```
方法 3:
scaled_train = (train - train_mean) / train_std_deviation scaled_test = (test - train_mean) / train_std_deviation
That’s right, the “correct” way is Scenario 3. I agree, it may look a bit odd to use the training parameters and re-use them to scale the test dataset. (Note that in practice, if the dataset is sufficiently large, we wouldn’t notice any substantial difference between the scenarios 1-3 because we assume that the samples have all been drawn from the same distribution.)
翻译:正确的方法是3:使用在训练集上得出的参数并重新使用它们来缩放 (scale) 测试集。(请注意,在实践中,如果数据集足够大,方法1-3之间不会有任何实质性差异,因为我们假设所有样本都来自相同的分布。)
Again, why Scenario 3? The reason is that we want to pretend that the test data is “new, unseen data.” We use the test dataset to get a good estimate of how our model performs on any new data.
翻译:选择方法3的原因是,测试集数据是新的、没有见过的数据。我们使用测试集来估计模型在新的数据上的表现。
Now, in a real application, the new, unseen data could be just 1 data point that we want to classify. (How do we estimate mean and standard deviation if we have only 1 data point?) That’s an intuitive case to show why we need to keep and use the training data parameters for scaling the test set.
翻译:新的、没有见过的数据可能只是1个数据 (1 data point) 而不是一些数据。(如果只有一个数据,我们无法估计平均值和标准差)。 这是一个例子来说明为什么我们需要保留和使用训练集上得出的参数来缩放测试集。
To recapitulate: If we standardize our training dataset, we need to keep the parameters (mean and standard deviation for each feature). Then, we’d use these parameters to transform our test data and any future data later on.
翻译:总结一下:如果我们标准化我们的训练集,我们需要保留训练集上得出的参数 (每个特征的均值和标准差)。
然后,我们将使用这些参数来转换 (transform) 我们的测试集和以后的任何数据。
Let’s imagine we have a simple training set consisting of 3 samples with 1 feature column (let’s call the feature column “length in cm”):
- sample1: 10 cm -> class 2
- sample2: 20 cm -> class 2
- sample3: 30 cm -> class 1
翻译:假设我们的训练集只有三个例子,一个特征 (长度,单位为cm)
Given the data above, we compute the following parameters:
- mean: 20
- standard deviation: 8.2
翻译:根据上面的数据,我们计算以下参数:
If we use these parameters to standardize the same dataset, we get the following values:
- sample1: -1.21 -> class 2
- sample2: 0 -> class 2
- sample3: 1.21 -> class1
翻译:如果我们使用这些参数来标准化训练集,我们将得到以下值:
Now, let’s say our model has learned the following hypotheses: It classifies samples with a standardized length value < 0.6 as class 2 (class 1 otherwise). So far so good. Now, let’s imagine we have 3 new unlabelled data points that you want to classify.
- sample4: 5 cm -> class ?
- sample5: 6 cm -> class ?
- sample6: 7 cm -> class ?
翻译:假设:如果新的数据的标准化后的长度小于0.6,则被分为class 2 (否则为class 1)。现在,有3个新的数据需要分类:
If we look at the “unstandardized “length in cm” values in our training dataset, it is intuitive to say that all of these samples are likely belonging to class 2. However, if we standardize these by re-computing the standard deviation and and mean from the new data, we would get similar values as before (i.e., properties of a standard normal distribution) in the training set and our classifier would (probably incorrectly) assign the “class 2” label to the samples 4 and 5.
- sample4: -1.21 -> class 2
- sample5: 0 -> class 2
- sample6: 1.21 -> class 1
翻译:如果我们查看未被标准化的原始训练集:
那么,例子4、5、6很可能都属于class 2 (因为长度都小于例子1的10cm)。
在测试集上计算出来的mean和standard deviation:
如果我们使用在测试集上计算出来的mean和standard deviation来标准化例子4、5、6,我们会得到与标准化后的训练集类似的值,此时分类的情况与有所不同 (例子6被错误的分类为class 1):
However, if we use the parameters from your “training set standardization, we will get the following standardized values
- sample4: -18.37
- sample5: -17.15
- sample6: -15.92
翻译:但是,如果我们在训练集上得出参数并重新使用它们来缩放 (scale) 例子4、5、6,我们将得到以下的值 (这些值我计算的和原文不一样,但是不妨碍理解。根据“如果新的数据的标准化后的长度小于0.6,则被分为class 2 (否则为class 1)”分类):
Note that these values are more negative than the value of sample1 in the original training set, which makes much more sense now!
翻译:请注意,这些值小于标准化后的原始训练集中的例子1的值,符合之前提到的分类:
如果我们查看未被标准化的原始训练集:
那么,例子4、5、6很可能都属于class 2 (因为长度都小于例子1的10cm)。
结论:
总的来说,需要在训练集上得出参数并重新使用它们来缩放测试集的原因如下:
后面的结论是我自己总结的,如有不对请多指教,欢迎一起讨论~