This personal note is written after studying the opening course on the coursera website, Machine Learning by Andrew NG . And images, audios of this note all comes from the opening course.
Once we have done some trouble shooting for errors in our predictions by:
Increasing or decreasing λ
We can move on to evaluate our new hypothesis.
A hypothesis may have a low error for the training examples but still be inaccurate (because of overfitting). Thus, to evaluate a hypothesis, given a dataset of training examples, we can split up the data into two sets: a training set and a test set . Typically, the training set consists of 70 % of your data and the test set is the remaining 30 %.
The new procedure using these two sets is then:
The test set error For linear regression: J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}{test}) - y^{(i)}{test})^2$
For classification ~ Misclassification error (aka 0/1 misclassification error):
This gives us a binary 0 or 1 error result based on a misclassification. The average test error for the test set is:
This gives us the proportion of the test data that was misclassified.
Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis. It could overfit and as a result your predictions on the test set would be poor. The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than the error on any other data set.
Given many models with different polynomial degrees, we can use a systematic approach to identify the ‘best’ function. In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.
One way to break down our dataset into the three sets is:
We can now calculate three separate error values for the three different sets using the following method:
This way, the degree of the polynomial d has not been trained using the test set.
Training error:
In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.
The training error will tend to decrease as we increase the degree d of the polynomial.
At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.
High bias (underfitting): both Jtrain(Θ) J t r a i n ( Θ ) and JCV(Θ) J C V ( Θ ) will be high. Also, JCV(Θ)≈Jtrain(Θ) J C V ( Θ ) ≈ J t r a i n ( Θ ) .
High variance (overfitting): Jtrain(Θ) J t r a i n ( Θ ) will be low and JCV(Θ) J C V ( Θ ) will be much greater than Jtrain(Θ) J t r a i n ( Θ ) .
The is summarized in the figure below:
Note: [The regularization term below and through out the video should be λ2m∑nj=1θ2j λ 2 m ∑ j = 1 n θ j 2 and NOT λ2m∑mj=1θ2j λ 2 m ∑ j = 1 m θ j 2 ]
In the figure above, we see that as λ λ increases, our fit becomes more rigid. On the other hand, as λ λ approaches 0, we tend to over overfit the data. So how do we choose our parameter λ λ to get it ‘just right’ ? In order to choose the model and the regularization term λ λ , we need to:
Training an algorithm on a very few number of data points (such as 1, 2 or 3) will easily have 0 errors because we can always find a quadratic curve that touches exactly those number of points. Hence:
As the training set gets larger, the error for a quadratic function increases.
The error value will plateau out after a certain m, or training set size.
Experiencing high bias:
Low training set size : causes Jtrain(Θ) J t r a i n ( Θ ) to be low and JCV(Θ) J C V ( Θ ) to be high.
Large training set size : causes both Jtrain(Θ) J t r a i n ( Θ ) and JCV(Θ) J C V ( Θ ) to be high with Jtrain(Θ) J t r a i n ( Θ ) ≈ JCV(Θ) J C V ( Θ ) .
If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.
Experiencing high variance:
Low training set size : Jtrain(Θ) J t r a i n ( Θ ) will be low and JCV(Θ) J C V ( Θ ) will be high.
Large training set size: Jtrain(Θ) J t r a i n ( Θ ) increases with training set size and JCV J C V (\Theta) continues to decrease without leveling off. Also, Jtrain(Θ) J t r a i n ( Θ ) < JCV(Θ) J C V ( Θ ) but the difference between them remains significant.
If a learning algorithm is suffering from high variance, getting more training data is likely to help.
Our decision process can be broken down as follows:
Getting more training examples: Fixes high variance
Trying smaller sets of features: Fixes high variance
Adding features: Fixes high bias
Adding polynomial features: Fixes high bias
Decreasing λ: Fixes high bias
Increasing λ: Fixes high variance.
Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. You can then select the one that performs best.
Model Complexity Effects: