数据集分为训练验证测试

测试我们的模型 (Testing Our Model)

Supervised machine learning algorithms are amazing tools capable of making predictions and classifications. However, it is important to ask yourself how accurate those predictions are. After all, it’s possible that every prediction your classifier makes is actually wrong! Luckily, we can leverage the fact that supervised machine learning algorithms, by definition, have a dataset of pre-labeled data points. In order to test the effectiveness of your algorithm, we’ll split this data into:

监督式机器学习算法是能够做出预测和分类的出色工具。但是，重要的是要问自己这些预测的准确性如何。毕竟，您的分类器所做的每个预测实际上都是错误的！幸运的是，根据定义，我们可以利用监督机器学习算法具有预先标记数据点的数据集这一事实。为了测试您算法的有效性，我们将这些数据分为：

training set
训练集
validation set
验证集
test set
测试集

训练集与验证集 (Training Set vs Validation Set)

The training set is the data that the algorithm will learn from. Learning looks different depending on which algorithm you are using. For example, when using Linear Regression, the points in the training set are used to draw the line of best fit. In K-Nearest Neighbors, the points in the training set are the points that could be the neighbors.

训练集是算法将从中学习的数据。根据您使用的算法，学习看起来会有所不同。例如，当使用线性回归时 ，训练集中的点将用于绘制最佳拟合线。在“ K最近邻 ”中，训练集中的点是可能成为邻居的点。

After training using the training set, the points in the validation set are used to compute the accuracy or error of the classifier. The key insight here is that we know the true labels of every point in the validation set, but we’re temporarily going to pretend like we don’t. We can use every point in the validation set as input to our classifier. We’ll then receive a classification for that point. We can now peek at the true label of the validation point and see whether we got it right or not. If we do this for every point in the validation set, we can compute the validation error!

使用训练集进行训练后，将验证集中的点用于计算分类器的准确性或误差。此处的关键见解是，我们知道验证集中每个点的真实标签，但是我们暂时假装不知道。我们可以使用验证集中的每个点作为分类器的输入。然后，我们将收到该点的分类。现在，我们可以窥视验证点的真实标签，看看是否正确。如果我们对验证集中的每个点都执行此操作，则可以计算验证错误！

Validation error might not be the only metric we’re interested in. A better way of judging the effectiveness of a machine learning algorithm is to compute its precision, recall, and F1 score.

验证错误可能不是我们感兴趣的唯一度量标准。判断机器学习算法有效性的更好方法是计算其精度，召回率和F1分数。

如何分裂 (How to Split)

Figuring out how much of your data should be split into your validation set is a tricky question. If your training set is too small, then your algorithm might not have enough data to effectively learn. On the other hand, if your validation set is too small, then your accuracy, precision, recall, and F1 score could have a large variance. You might happen to get a really lucky or a really unlucky split! In general, putting 80% of your data in the training set, and 20% of your data in the validation set is a good place to start.

弄清楚应将多少数据分成验证集是一个棘手的问题。如果训练集太小，则您的算法可能没有足够的数据来有效学习。另一方面，如果您的验证集太小，则您的准确性，准确性，召回率和F1得分可能会有较大差异。您可能碰巧遇到了一个真正幸运或非常不幸的分裂！通常，将80％的数据放入训练集中，将20％的数据放入验证集中是一个不错的起点。

N折交叉验证 (N-Fold Cross-Validation)

Sometimes your dataset is so small, that splitting it 80/20 will still result in a large amount of variance. One solution to this is to perform N-Fold Cross-Validation. The central idea here is that we’re going to do this entire process N times and average the accuracy. For example, in 10-fold cross-validation, we’ll make the validation set the first 10% of the data and calculate accuracy, precision, recall and F1 score. We’ll then make the validation set the second 10% of the data and calculate these statistics again. We can do this process 10 times, and every time the validation set will be a different chunk of the data. If we then average all of the accuracies, we will have a better sense of how our model does on average.

有时，您的数据集非常小，以至于将其拆分为80/20仍会导致大量差异。一种解决方案是执行N折交叉验证 。这里的中心思想是，我们将整个过程进行N次，并取平均精度。例如，在10倍交叉验证中，我们将验证集设置为数据的前10％，并计算准确性，准确性，召回率和F1得分。然后，我们将验证设置为数据的后10％，然后再次计算这些统计信息。我们可以执行10次此过程，并且每次验证集都是不同的数据块。如果我们随后将所有精度平均，则可以更好地了解我们的模型的平均效果。

更改模型/测试集 (Changing The Model / Test Set)

Understanding the accuracy of your model is invaluable because you can begin to tune the parameters of your model to increase its performance. For example, in the K-Nearest Neighbors algorithm, you can watch what happens to accuracy as you increase or decrease K. (You can try out all of this in our K-Nearest Neighbors lesson!)

了解模型的准确性非常重要，因为您可以开始调整模型的参数以提高其性能。例如，在“ K最近邻”算法中，您可以观察增加或减小K时精度的变化。(您可以在我们的“ K最近邻”课程中尝试所有这些方法！)

Once you’re happy with your model’s performance, it is time to introduce the test set. This is part of your data that you partitioned away at the very start of your experiment. It’s meant to be a substitute for the data in the real world that you’re actually interested in classifying. It functions very similarly to the validation set, except you never touched this data while building or tuning your model. By finding the accuracy, precision, recall, and F1 score on the test set, you get a good understanding of how well your algorithm will do in the real world.

对模型的性能满意后，就该介绍测试集了。这是您在实验开始时就将数据分区的一部分。它旨在替代您实际上对分类感兴趣的现实世界中的数据。它的功能与验证集非常相似，只不过您在构建或调整模型时从未接触过此数据。通过在测试集上找到准确性，准确性，召回率和F1分数，您可以很好地了解算法在现实世界中的表现。

翻译自: https://medium.com/@vinaykumarpaspula/splitting-a-data-set-into-training-validation-and-test-sets-f1654b7574c