【机器学习】3 正则化

第3章 正则化

  • 1 Addressing Overfitting
  • 2 Regularization
    • 2.1 Theory
    • 2.2 Model
    • 2.3 λ \lambda λ 设置值太大可能导致的问题
  • 3 Regularized Linear Regession
    • 3.1 Gradient Descent
    • 3.2 Normal Equation
  • 4 Regularized Logistic Regession
    • 4.1 Gradient Descent
    • 4.2 Advanced optimization
  • 5 Reference

1 Addressing Overfitting

Reduce number of features Regularization
Manually select which features to keep Keep all the features but reduce magnitude / values of parameters θ j \theta_j θj
Model selection algorithm(Automatically select) Works well when we have a lot of features, each of which contributese a bit to predicting y y y
有可能会丢失信息

2 Regularization

2.1 Theory

Small values for parameters θ 0 , θ 1 , ⋅ ⋅ ⋅ , θ n \theta_0,\theta_1,···,\theta_n θ0,θ1,,θn

  • “Simpler” hypothesis
  • Less prone to overfitting

2.2 Model

  • minimize θ J ( θ ) \mathop{\text{minimize}}\limits_{\theta} J(\theta) θminimizeJ(θ)
    J ( θ ) = 1 2 m [ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + λ ∑ j = 1 n θ j 2 ] J(\theta)=\frac{1}{2m}\left[\sum_{i=1}^m(h_\theta{(x^{(i)})-y^{(i)})}^2+\lambda\sum_{j=1}^n{\theta_j}^2\right] J(θ)=2m1[i=1m(hθ(x(i))y(i))2+λj=1nθj2]

λ \lambda λ:Regularization Parameter
λ ∑ j = 1 n θ j 2 \lambda\sum_{j=1}^n{\theta_j}^2 λj=1nθj2:to shrink every single parameter θ j \theta_j θj
【注意】从 θ 1 \theta_1 θ1开始(其实从 θ 0 \theta_0 θ0开始影响不大)

2.3 λ \lambda λ 设置值太大可能导致的问题

  • Algorithm works fine; setting λ \lambda λ to be very large can not hurt it
  • Algorithm fails to eliminate overfitting
  • Algorithm results in underfitting.(Fails to fit even training data well)
  • Gradient descent will fail to converge

3 Regularized Linear Regession

3.1 Gradient Descent

  • repeat until convergence{
       θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) θ j : = θ j − α [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) + λ m θ j ] 变 式 : θ j : = θ j ( 1 − α λ m ) − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) \begin{aligned} \theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \theta_j&:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)}+\frac{\lambda}{m}\theta_j\right]\\ 变式:\theta_j&:=\theta_j(1-\alpha\frac{\lambda}{m})-\alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)} \end{aligned} θ0θjθj:=θ0αm1i=1m(hθ(x(i))y(i))x0(i):=θjα[m1i=1m(hθ(x(i))y(i))xj(i)+mλθj]:=θj(1αmλ)αm1i=1m(hθ(x(i))y(i))xj(i)
    j = 1 , 2 , 3 , ⋅ ⋅ ⋅ , n j=1,2,3,···,n j=1,2,3,,n
    }

3.2 Normal Equation

θ = ( X T X + λ [ 0 1 1 ⋱ 1 ] ( n + 1 ) × ( n + 1 ) ) − 1 X T y \theta={\left(X^TX+\lambda\left[ \begin{matrix} 0\\ &1\\ &&1\\ &&&\ddots\\ &&&&1 \end{matrix}\right]_{(n+1)×(n+1)}\right)} ^{-1}X^Ty θ=XTX+λ0111(n+1)×(n+1)1XTy

  • 只要 λ > 0 \lambda>0 λ0,就不会non-invertibility

4 Regularized Logistic Regession

4.1 Gradient Descent

  • J ( θ ) = − 1 m ∑ i = 1 m [ y ( i ) log h θ ( x ( i ) ) + ( 1 − y ( i ) ) log ( 1 − h θ ( x ( i ) ) ] + λ 2 m ∑ j = 1 n θ j 2 J(\theta)=-\frac{1}{m}\sum_{i=1}^m\left[y^{(i)}\text{log}h_{\theta}(x^{(i)})+(1-y^{(i)})\text{log}(1-h_{\theta}(x^{(i)})\right]+\frac{\lambda}{2m}\sum_{j=1}^n{\theta_j}^2 J(θ)=m1i=1m[y(i)loghθ(x(i))+(1y(i))log(1hθ(x(i))]+2mλj=1nθj2
  • repeat until convergence{
       θ 0 : = θ 0 − α 1 m ∑ i = 1 m ( ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x 0 ( i ) ) θ j : = θ j − α [ 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) ⋅ x j ( i ) − λ m θ j ] \begin{aligned} \theta_0&:=\theta_0-\alpha\frac{1}{m}\sum_{i=1}^m((h_\theta(x^{(i)})-y^{(i)})·x_0^{(i)})\\ \theta_j&:=\theta_j-\alpha\left[\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})·x_j^{(i)}-\frac{\lambda}{m}\theta_j\right] \end{aligned} θ0θj:=θ0αm1i=1m((hθ(x(i))y(i))x0(i)):=θjα[m1i=1m(hθ(x(i))y(i))xj(i)mλθj]
    j = 1 , 2 , 3 , ⋅ ⋅ ⋅ , n j=1,2,3,···,n j=1,2,3,,n
    (simultaneously update all θ j \theta_j θj
    }
import numpy as np
def costReg(theta, X, y, learningRate):
	theta = np.matrix(theta)
	X = np.matrix(X)
	y = np.matrix(y)
	first = np.multiply(-y, np.log(sigmoid(X*theta.T)))
	second = np.multiply((1-y),np.log(1-sigmoid(X*theta*T)))
	reg = (learningRate / (2 * len(X))* np.sum(np.power(theta[:,1:theta.shape[1]],2))
	return np.sum(first - second) / (len(X)) + reg

4.2 Advanced optimization

Octave代码
function [jval, gradient] = costFunction(theta)
	jVal = [...code to compute J(theta)...];
	gradient = [... code to compute derivative of J(theta)...];
end
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);

[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

5 Reference

吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记

你可能感兴趣的:(机器学习,机器学习)