For a deep learning or machine learning model, we not only require it to have a good fit on the training dataset (training error), but also expect it to have a good fit on an unknown dataset (test set) ( generalization ability), the resulting test error is called generalization error. To measure the generalization ability, the most intuitive performance is the overfitting and underfitting of the model. Overfitting and underfitting are used to describe the two states of a model during training. Generally speaking, the training process will be a graph as shown below.
At the beginning of training, the model is still in the learning process and is in the underfitting region. As training progresses, both training error and test error decrease. After reaching a critical point, the error of the training set decreases, and the error of the test set increases, and at this time it enters the overfitting region-because the trained network overfits the training set, but the data outside the training set does not. not work.
- What is underfitting?
Underfitting is when the model fails to obtain a low enough error on the training set. In other words, the model complexity is low, and the model performs poorly on the training set, and cannot learn the laws behind the data.
How to solve underfitting?
Underfitting basically occurs at the beginning of training, and underfitting should not be considered much after continuous training. But if it still exists, you can increase the complexity of the network or add features to the model, which are good solutions to underfitting.
- What is overfitting?
Overfitting is when the gap between training error and test error is too large. In other words, the model complexity is higher than the actual problem, and the model performs well on the training set but poorly on the test set. The model "rotates" the training set (remembers the properties or characteristics of the training set that are not applicable to the test set), does not understand the laws behind the data, and has poor generalization ability.
Why does overfitting occur?
The main reasons are as follows: 1. The training data set has a single sample and insufficient samples. If the training samples only have negative samples, then the generated model predicts positive samples, which is definitely not accurate. Therefore, the training samples should be as comprehensive as possible, covering all data types. 2. The noise interference in the training data is too large. Noise refers to interfering data in the training data. Too much interference will result in the recording of many noisy features, ignoring the relationship between the real input and output. 3. The model is too complicated. The model is too complex, and it has been able to "rotate" the information of the training data, but it cannot be adapted when encountering unseen data, and the generalization ability is too poor. We want the model to have stable output for different models. Models that are too complex are an important factor in overfitting.
- How to prevent overfitting?
To solve the overfitting problem, it is necessary to significantly reduce the test error without excessively increasing the training error, thereby improving the generalization ability of the model. We can use the Regularization method. So what is regularization? Regularization refers to modifying the learning algorithm so that it reduces generalization error rather than training error.
Commonly used regularization methods can be divided into: (1) parameter regularization methods that directly provide regularization constraints, such as L1/L2 regularization; (2) achieve lower generalization through engineering skills Error methods, such as Early stopping and Dropout; (3) Implicit regularization methods that do not directly provide constraints, such as data augmentation.
Acquiring and Using More Data (Dataset Augmentation) - A Fundamental Approach to Overfitting
The best way to make a machine learning or deep learning model generalize better is to use more data for training. However, in practice, the amount of data we have is limited. One way to solve this problem is to create "fake data" and add to the training set - data set augmentation. Increase the size of the training set by adding additional copies of the training set, thereby improving the generalization ability of the model.
Taking the image dataset as an example, we can do: rotate the image, zoom the image, randomly crop, add random noise, translate, mirror, etc. to increase the amount of data. In addition, in the problem of object classification, CNN has a strong "invariance" rule in the process of image recognition, that is, the shape, posture, position, and overall brightness of the object to be recognized in the image will not affect the classification. result. We can multiply the database by means of image translation, flipping, zooming, and cutting.
Adopt a suitable model (control the complexity of the model)
Overly complex models can lead to overfitting problems. For the design of the model, it is currently recognized that a deep learning law "deeper is better". Various domestic and foreign experts have found through experiments and competitions that for CNN, the more layers, the better the effect, but it is also easier to produce overfitting, and the calculation time is longer.
According to Occam's Razor: Of the hypotheses that can also explain known observed phenomena, we should pick the "simplest" one. For the design of models, we should choose simple and appropriate models to solve complex problems.
reduce the number of features
For some feature engineering, it is possible to reduce the number of features—remove redundant features and manually choose which ones to keep. This method can also solve the overfitting problem.
L1/L2 regularization
(1) L1 regularization
Add an L1 regularization term to the original loss function, which is the sum of the absolute values of all weights [formula], multiplied by λ/n. Then the loss function becomes:
[formula]
Corresponding gradient (derivative):
Wherein [Formula] simply takes the sign of each element of [Formula].
The weights [formula] update during gradient descent becomes:
When [formula], |w| is not differentiable. So we can only update w according to the original unregularized method.
When [Formula], sgn([Formula] )>0, then the updated [Formula] becomes smaller during gradient descent.
When [Formula], sgn([Formula] )>0, then the updated [Formula] becomes larger during gradient descent. In other words, L1 regularization makes the weight [formula] lean towards 0, so that the weight in the network is as 0 as possible, which is equivalent to reducing the complexity of the network and preventing overfitting.
This is why L1 regularization produces sparser solutions. Here sparsity means that some parameters in the optimal value are 0. The sparse nature of L1 regularization has been widely used in feature selection mechanisms to select meaningful features from a subset of available features.
(2) L2 regularization
L2 regularization is often referred to as weight decay, which is to add an L2 regularization term to the original loss function, that is, the sum of the squares of all weights [formula], and then multiply by λ/2n. Then the loss function becomes:
Corresponding gradient (derivative):
It can be found that the L2 regularization term has no effect on the update of the bias b, but has an impact on the update of the weight [formula]:
The [formula] here are all greater than 0, so the [formula] is less than 1. So during gradient descent, the weight [formula] will gradually decrease, tending to 0 but not equal to 0. This is the origin of weight decay.
L2 regularization has the effect of making the weight parameter [formula] smaller. Why can it prevent overfitting? Because the smaller weight parameter [formula] means that the complexity of the model is lower, the fit to the training data is just right, and the training data will not be overfit, thereby improving the generalization ability of the model.
dropout
Dropout is a trick used when training the network, which is equivalent to adding noise to the hidden units. Dropout refers to randomly "removing" a part of hidden units (neurons) with a certain probability (such as 50%) each time during the training process. The so-called "deletion" is not a deletion in the real sense, in fact, the activation function of the part of the neurons is set to 0 (the output of the activation function is 0), so that these neurons do not calculate.
Why does dropout help prevent overfitting?
(a) Different training models will be generated during the training process, and different training models will also generate different calculation results. As the training continues, the calculation results will fluctuate within a range, but the mean value will not change greatly, so the final training result can be regarded as the average output of different models.
(b) It eliminates or weakens the association between neuron nodes, reducing the network's dependence on a single neuron, thereby enhancing the generalization ability.
Early stopping
The process of training the model is the process of learning and updating the parameters of the model. This parameter learning process often uses some iterative methods, such as gradient descent. Early stopping is a method of truncating the number of iterations to prevent overfitting, that is, stopping iterations before the model iteratively converges on the training dataset to prevent overfitting.
In order to obtain a good performing neural network, many epochs (the number of times to traverse the entire dataset, one epoch at a time) may be passed during the training process. If the number of epochs is too small, the network may underfit; if the number of epochs is too large, overfitting may occur. Early stopping is designed to solve problems where the number of epochs needs to be set manually. Specific approach: After each epoch (or every N epoch), obtain the test results on the validation set. As the epoch increases, if the test error is found to increase on the validation set, stop training and use the weight after the stop as The final parameters of the network.
Why can you prevent overfitting? When the neural network has not run too many iterations, the w parameter is close to 0, because when the w value is randomly initialized, its value is a small random value. As you start the iterative process, the value of w will get bigger and bigger. At the back, the value of w has become very large. So what early stopping does is stop the iterative process at an intermediate point. We will get a medium w parameter, and we will get similar results to L2 regularization, choosing a neural network with a small w parameter.
Disadvantage of Early Stopping: Instead of solving the two problems of optimizing the loss function and overfitting in different ways, one method is used to solve both problems at the same time, and the result is that the things to consider become more complex. The reason why it cannot be handled independently is because if you stop optimizing the loss function, you may find that the value of the loss function is not small enough, and at the same time you do not want to overfit.