Machine Learning - Gradient Descent in Practice

This article contains some skills in the implementation of Gradient Descent, including feature scaling, mean normalization and choosing learning rate.


Gradient descent in practice


1. Feature Scaling


Feature Scaling

Idea: Make sure features are on a similar scale, so the gradient descent will converge much faster.

  E.g.     x1 = size (0-2000 feet2)

              x2 = number of bedrooms (1-5) 


Machine Learning - Gradient Descent in Practice_第1张图片

Get every feature into approximately a -1≤ x≤1 range.

These ranges are ok:


These ranges are not good

Mean normalization 

Replace xi  with xi -μi to make features have approximately zero mean (Do not apply tox0 =1). Where μiis the mean value of feature i.

E.g.




2. Learning rate


Gradient descent 

  • “Debugging”: How to make sure gradient descent is working correctly.
  • How to choose learning rate α

Making sure gradient descent is working correctly.

  • J(θ) should decrease every iteration.


Example automatic convergence test:

Declare convergence if J(θ) decreases by less than 10-3 in one iteration.

  • Gradient descent not working: Use smaller α.
Machine Learning - Gradient Descent in Practice_第2张图片

Machine Learning - Gradient Descent in Practice_第3张图片

  • For sufficiently small α , J(θ) should decrease on every iteration.
  • But if α is too small, gradient descent can be slow to converge.

Summary: 

  • If α is too small: slow convergence.
  • If α is too large: J(θ) may not decrease on every iteration; may not converge.

To choose α, try:  ...,0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...


3. Matlab Code of Gradient descent with Multiple Variables

function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
//% X is the training data (design matrix)
//% y is the label of training data (in vector)
//% theta is the vector of parameters
//% alpha is the learning rate
//% num_iters is the number of iterations

//% Initialize some values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

    for iter = 1:num_iters
 
        //% Perform a single gradient step on the parameter vector  theta. 
 
        for i = 1:size(X,2)
            theta_tmp(i) = theta(i) - alpha*(sum((X*theta-y).*X(:,i)))/m;
        end
        theta = theta_tmp';
 
         //% Save the cost J in every iteration    
        J_history(iter) = computeCostMulti(X, y, theta);  //% compute J(theta)
 
    end 
end


function J = computeCostMulti(X, y, theta)
//% COMPUTECOSTMULTI Compute cost for linear regression with multiple variables

//% Initialize some useful values
m = length(y); % number of training examples
J = 0;

//% Compute the cost of a particular choice of theta
J = 0.5*(X*theta-y)'*(X*theta-y)/m;
  
end


你可能感兴趣的:(Machine Learning - Gradient Descent in Practice)