机器学习笔记~K折交叉验证(K-Fold Cross Validation)的应用和局限性

1. 交叉验证的应用

1)交叉验证被用于比较在特定数据集上,不同机器学习模型的性能:

假设将两种机器学习模型K Nearest Neighbours (KNN) or Support Vector Machine (SVM)应用于MNIST数据集上,为了比较两种模型的分类性能,可使用交叉验证方法,这可以帮助选择在MNIST数据集中表现较好的一种模型。

2)交叉验证被用于选择合适的模型参数:

假设选择使用KNN分类器对MNIST数据集进行分类,为了建立此分类器,为分类器选择合适的k参数是至关重要的。然而,直观地选择k值并不是一个好主意,但是可以通过选择不同的k值并结合交叉验证方法,评估在取每一个k值时模型的性能,最后比较不同k值下模型的性能,选择模型表现最好时的k作为分类器参数。

2. 交叉验证的局限性

对于交叉验证方法而言,为了得到有意义的结果,训练数据集和测试数据集需要从相同的数据集中取得,所以数据量是交叉验证的一种限制条件,应防止取出的训练数据集过小而导致模型过拟合的情况。并且,在选取测试数据集时,需要控制人为偏差,否则交叉验证变得毫无意义。

 

参考网站:

1. https://magoosh.com/data-science/k-fold-cross-validation/

原文:

1. Applications of Cross Validation

1)The cross validation technique can be used to compare the performance of different machine learning models on the same data set. To understand this point better, consider the following example.

Suppose you want to make a classifier for the MNIST data set, which consists of hand-written numerals from 0 to 9. You are considering using either K Nearest Neighbours (KNN) or Support Vector Machine (SVM). To compare the performance of the two machine learning models on the given data set, you can use cross validation. This will help you determine which predictive model you should choose working with for the MNIST data set.

2)Cross validation can also be used for selecting suitable parameters. The example mentioned below will illustrate this point well.

Suppose you have to build a K Nearest Neighbours (KNN) classifier for the MNIST data set. To use this classifier, you should provide an appropriate value of the parameter k to the classifier. Choosing the value of k intuitively is not a good idea (beware of overfitting!). You can play around with different values of the parameter and use cross validation to estimate the performance of the predictive model corresponding to each k. You should finally go ahead with the value of k that gives the best performance of the predictive model on the given data set.

2. Limitations of Cross Validation

For cross validation to give some meaningful results, the training set and the validation set are required to be drawn from the same population. Also, human biases need to be controlled, or else cross validation will not be fruitful.

你可能感兴趣的:(机器学习)