How to Prevent Overfitting in Machine Learning Models

Very deep neural networks with a huge number of parameters are very robust machine learning systems. But, in this type of massive networks, overfitting is a common serious problem. Learning how to deal with overfitting is essential to mastering machine learning. The fundamental issue in machine learning is the tension between optimization and generalization. Optimization refers to the process of adjusting a model to get the best performance possible on the training data (the learningin machine learning), whereas generalization refers to how well the trained model performs on the data that it has **never seen before ** (test set). The goal of the game is to get a good generalization. But, you don’t control generalization; you can only adjust the model based on its training data.
具有大量参数的深度神经网络是一种非常鲁棒的机器学习系统。但是，这类大规模的网络，常常面临过拟合的困扰。如何解决过拟合的问题，是机器学习中必须掌握的一种能力。机器学习的中的基本问题就是在最优化和泛化之间的平衡。最优化是尽量在训练数据集上尽可能取得最好的性能。但是，泛化要求模型在从来没见过的数据上表现出最好的性能。但是，我们无法控制泛化，我们能做的就是再训练数据集上调整优化模型，让模型尽可能提升泛化能力。

How do you know whether a model is overfitting?

**如何判断模型是否过拟合？

The clear sign of overfitting is when the model accuracy is high in the training set, but the accuracy drops significantly with new data or in the test set. This means the model knows the training data very well, but can not generalize. This case makes your model useless in production or AB test in most domains.
模型过拟合最明显的特征就是在训练集上高正确率，但是在新数据集或测试集上正确率会明显下降。这说明模型对训练数据集学习的太好了，但是无法泛化。这样的模型就无法在实际产品或AB测试中使用。

How can you prevent overfitting?

如何防止过拟合？

Okay, now let’s say you found that your model overfits. But what to do now to prevent your model from overfitting?
Fortunately, there are many ways you can try to prevent your model from overfitting. Below I have described a few of the most widely used solutions for overfitting.
如果我们发现了模型过拟合，那如何解决过拟合的问题呢？
幸运的是，有些办法来阻止模型的过拟合发生。

1. Reduce the network size

1. 减小网络尺寸

The simplest way to prevent overfitting is to reduce the size of the model: the number of learnable parameters in the model (which is determined by the number of layers and the number of units per layer).
解决过拟合最简单的方式是减小模型的规模，也就是模型要学习的参数数量。这些参数决定了网络层数和每层的单元数。

2. Cross-Validation

2. 交叉验证

In cross-validation, the initial training data is used as small train-test splits. Then, these splits are used to tune the model. The most popular form of cross-validation is K-fold cross-validation. K represents the number of folds. Here is a short video from Udacity which explains K-fold cross-validation very well.
交叉验证就是将训练集划分为很小的多份训练测试数据集，然后将这些划分后的测试集用来微调模型。最受欢迎的是K-flod 交叉验证。K表示的是folds的数量， Udacity上有一个短视频非常好的介绍了K-flod交叉验证.（我没找到视频链接）K-fold交叉验证也是非常常用的一种技术手段。

3. Add weight regularization

3. 增加权重规则化

Given two explanations for something, the explanation most likely to be correct is the simplest one — the one that makes fewer assumptions. This idea also applies to the models learned by neural networks: given some training data and a network architecture, multiple sets of weight values could explain the data. Simpler models are less likely to overfit than complex ones. A simple model in this context is a model where the distribution of parameter values has less entropy (or a model with fewer parameters). Thus a common way to mitigate overfitting is to put constraints on the complexity of a network by forcing its weights to take only small values, which makes the distribution of weight values more regular. This is called weight regularization, and it’s done by adding to the loss function of the network a cost associated with having large weights.
This cost comes in two flavors: L1 regularization — The cost added is proportional to the absolute value of the weight coefficients .
**L2 regularization **— The cost added is proportional to thesquare of the value of the weight coefficients. L2 regularization is also called weight decay in the context of neural networks. [1]
L1规则化：在权重系数绝对值上增加代价
L2规则化：在权重系数方差上增加代价

4. Removing irrelevant features

4. 删除不相关的特征

Improve the data by removing irrelevant features. A dataset may contain many features that do not contribute much to the prediction. Removing those less important features can improve accuracy and reduce overfitting. You can use the scikit-learn’s feature selection module for this pupose.
通过删除不相关的特征提升数据。数据集中包含许多特征，有些特征对最终的推理是没有贡献的。删除这些不重要的特征，提升正确率，降低过拟合。
可以使用scilit-learn‘s的特征选择来实现。

5. Adding dropout

5. 网络中增加dropout

Dropout, applied to a layer, consists of randomly dropping out(setting to zero) a number of output features of the layer during training. Let’s say a
given layer would normally return a vector [0.2, 0.5, 1.3, 0.8, 1.1] for a given input sample during training. After applying dropout, this vector will have a few zero entries distributed at random: for example, [0.2, 0.5, 1.3, 0.8, 1.1]
Dropout机制就是在训练过程中，相关层中机删除一定数量的输出特征（这些特征被置为0）。比如，给定一个层，训练过重中给定输入一般返回的向量是 [0.2, 0.5, 1.3, 0.8, 1.1] ，如果应用droput，随机替换部分值为0，变成[0.2, 0.5, 1.3, 0.8, 1.1] 。
6. Data Augmentation

6. 增加数据

The simplest way to reduce overfitting is to increase the size of the training data. Let’s consider we are dealing with images. In this case, there are a few ways of increasing the size of the training data — rotating the image, flipping, scaling, shifting, etc. This technique is known as data augmentation. This usually provides a big leap in improving the accuracy of the model.
降低过拟合最简单的方式是增加训练数据量。理论上，如果训练数据集是全量数据，就不会有过拟合了。增加训练数据的方法，比如对图像，可以通过旋转，拉伸，放大等方式。这些技术成为数据扩张，这种方法往往能大幅提升模型的能力。

如何防止机器学习模型的过拟合（翻译笔记）