This article contains some skills in the implementation of Gradient Descent, including feature scaling, mean normalization and choosing learning rate.
Idea: Make sure features are on a similar scale, so the gradient descent will converge much faster.
E.g. x1 = size (0-2000 feet2)
x2 = number of bedrooms (1-5)
Get every feature into approximately a -1≤ xi ≤1 range.
These ranges are ok:
These ranges are not good
Replace xi with xi -μi to make features have approximately zero mean (Do not apply tox0 =1). Where μiis the mean value of feature i.
E.g.
Example automatic convergence test:
Declare convergence if J(θ) decreases by less than 10-3 in one iteration.
Summary:
To choose α, try: ...,0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters) //% X is the training data (design matrix) //% y is the label of training data (in vector) //% theta is the vector of parameters //% alpha is the learning rate //% num_iters is the number of iterations //% Initialize some values m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters //% Perform a single gradient step on the parameter vector theta. for i = 1:size(X,2) theta_tmp(i) = theta(i) - alpha*(sum((X*theta-y).*X(:,i)))/m; end theta = theta_tmp'; //% Save the cost J in every iteration J_history(iter) = computeCostMulti(X, y, theta); //% compute J(theta) end end
function J = computeCostMulti(X, y, theta) //% COMPUTECOSTMULTI Compute cost for linear regression with multiple variables //% Initialize some useful values m = length(y); % number of training examples J = 0; //% Compute the cost of a particular choice of theta J = 0.5*(X*theta-y)'*(X*theta-y)/m; end