编者按:本系列系统总结Ng机器学习课程(http://cs229.stanford.edu/materials.html) Notes理论要点,并且给出所有课程exercise的作业code和实验结果分析。”游泳是游会的“,希望通过这个系列可以深刻理解机器学习算法,并且自己动手写出work高效的机器学习算法code应用到真实数据集做实验,理论和实战兼备。
Part 1 Linear Regression
1. Supervised Learning
在Supervise Learning的Setting中,我们有若干训练数据(x^(i),y^(i)) i= 1,...,m ,这里i用于index training example。监督学习的任务就是要找到一个函数 (又称为模型或者假设hypothesis)H: X -> Y, 使得h(x)是相应值y的好的预测。整个过程可以描述为下图
2 Linear Regression
一般而言,我们会用feature向量来描述训练数据X,我们用x_j^i来表示,其中j用于index feature, i用于index训练样本。在监督学习里面,我们需要找到一个最佳的预测函数h(x),比如我们可以选取feature的线性组合函数
机器学习里面一般默认变量为列向量,因此这里是参数向量\theta的转置矩阵。同时我们还加上了”feature 0“即x_0 = 1 以便方便表示成为向量乘积的形式。为了寻找最优的参数\theta,我们可以最小化error function即cost function
这个就是least-squares cost function,通过最小化这个函数来寻找最优参数。
3 LMS算法
为了寻找最优参数,我们可以随机初始化,然后沿着梯度慢慢改变参数值(需要改变\theta所有维),观察cost function值的变化,这就是梯度下降法的思想。假设我们只有一个训练样本(x,y), 对参数\theta_j求偏导数有
我们可以得到下面的参数update rule
其中\alpha叫learning rate,用于调节每次迭代参数变化的大小,这就是LMS(least mean squares)算法。用直观的角度去理解,如果我们看到一个训练样本满足y^(i) - h(x(i))等于0,那么说明参数就不必再更新;反之,如果预测值error较大,那么参数的变化也需要比较大。
如果我们有多个训练样本,比如有m个样本,每个样本用n个feature来描述,那么GD的update rule需要对n个feature对应的n个参数都做更新,有两种更新方式:batch gradient descent和stochastic/incremental gradient descent。对于前者,每次更新一轮参数\theta_j(注意n个参数需要同步更新才算完成一轮)需要都需要考虑所有的m个训练样本,即
也就是每更新一个\theta_j我们需要计算所有m个训练样本的prediction error然后求和。而后者更新一轮参数\theta_j我们只需要考虑一个训练样本,然后逐个考虑完所有样本(因此是incremental的)即
当训练样本size m非常大时,显然stochastic/incremental gradient descent会更有优势,因为每更新一轮参数不需要扫描所有的训练样本。
我们也可以把cost function写出矩阵相乘的形式,即令
我们将J(\theta)对向量\theta求梯度(对于向量求导,得到的是梯度,是有方向的,这里需要用到matrix calculus,比标量形式下求导麻烦一些,详见NG课程notes),令梯度为0可以直接得到极值点,也就是唯一全局最优解情形下的最值点(normal equations)
3 编程实战
(注:本部分编程习题全部来自Andrew Ng机器学习网上公开课)
3.1 单变量的Linear Regression
在单变量的Linear Regression中,每个训练样本只用一个feature来描述,例如某个卡车租赁公司分店的利润和当地人口总量的关系,给定若干人口总量和利润的训练样本,要求进行Linear Regression得到一条曲线,然后根据曲线对新的城市人口总量条件下进行利润的预测。
%% Initialization clear ; close all; clc %% ==================== Part 1: Basic Function ==================== % Complete warmUpExercise.m fprintf('Running warmUpExercise ... \n'); fprintf('5x5 Identity Matrix: \n'); warmUpExercise() fprintf('Program paused. Press enter to continue.\n'); pause; %% ======================= Part 2: Plotting ======================= fprintf('Plotting Data ...\n') data = load('ex1data1.txt'); X = data(:, 1); y = data(:, 2); m = length(y); % number of training examples % Plot Data % Note: You have to complete the code in plotData.m plotData(X, y); fprintf('Program paused. Press enter to continue.\n'); pause; %% =================== Part 3: Gradient descent =================== fprintf('Running Gradient Descent ...\n') X = [ones(m, 1), data(:,1)]; % Add a column of ones to x theta = zeros(2, 1); % initialize fitting parameters % Some gradient descent settings iterations = 1500; alpha = 0.01; % compute and display initial cost computeCost(X, y, theta) % run gradient descent theta = gradientDescent(X, y, theta, alpha, iterations); % print theta to screen fprintf('Theta found by gradient descent: '); fprintf('%f %f \n', theta(1), theta(2)); % Plot the linear fit hold on; % keep previous plot visible plot(X(:,2), X*theta, '-') legend('Training data', 'Linear regression') hold off % don't overlay any more plots on this figure % Predict values for population sizes of 35,000 and 70,000 predict1 = [1, 3.5] *theta; fprintf('For population = 35,000, we predict a profit of %f\n',... predict1*10000); predict2 = [1, 7] * theta; fprintf('For population = 70,000, we predict a profit of %f\n',... predict2*10000); fprintf('Program paused. Press enter to continue.\n'); pause; %% ============= Part 4: Visualizing J(theta_0, theta_1) ============= fprintf('Visualizing J(theta_0, theta_1) ...\n') % Grid over which we will calculate J theta0_vals = linspace(-10, 10, 100); theta1_vals = linspace(-1, 4, 100); % initialize J_vals to a matrix of 0's J_vals = zeros(length(theta0_vals), length(theta1_vals)); % Fill out J_vals for i = 1:length(theta0_vals) for j = 1:length(theta1_vals) t = [theta0_vals(i); theta1_vals(j)]; J_vals(i,j) = computeCost(X, y, t); end end % Because of the way meshgrids work in the surf command, we need to % transpose J_vals before calling surf, or else the axes will be flipped J_vals = J_vals'; % Surface plot figure; surf(theta0_vals, theta1_vals, J_vals) xlabel('\theta_0'); ylabel('\theta_1'); % Contour plot figure; % Plot J_vals as 15 contours spaced logarithmically between 0.01 and 100 contour(theta0_vals, theta1_vals, J_vals, logspace(-2, 3, 20)) xlabel('\theta_0'); ylabel('\theta_1'); hold on; plot(theta(1), theta(2), 'rx', 'MarkerSize', 10, 'LineWidth', 2);首先load进训练数据,并且visualize出来
然后需要实现两个函数 computeCost 和graientDescent,分别计算代价函数和对参数按照梯度方向进行更新,结合Linear Regression代价函数计算公式和参数更新Rule,我们可以实现如下
function J = computeCost(X, y, theta) %COMPUTECOST Compute cost for linear regression % J = COMPUTECOST(X, y, theta) computes the cost of using theta as the % parameter for linear regression to fit the data points in X and y % Initialize some useful values m = length(y); % number of training examples % You need to return the following variables correctly J = 0; % ====================== YOUR CODE HERE ====================== % Instructions: Compute the cost of a particular choice of theta % You should set J to the cost. J = 1/(2 * m) * (X * theta - y)' * (X * theta - y); % ========================================================================= end
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters) %GRADIENTDESCENT Performs gradient descent to learn theta % theta = GRADIENTDESENT(X, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha % Initialize some useful values m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters % ====================== YOUR CODE HERE ====================== % Instructions: Perform a single gradient step on the parameter vector % theta. % % Hint: While debugging, it can be useful to print out the values % of the cost function (computeCost) and gradient here. % % Batch gradient descent Update = 0; for i = 1:m Update = Update + alpha/m * (y(i) - X(i,:) * theta) * X(i, :)'; end theta = theta + Update; % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCost(X, y, theta); end end
Running Gradient Descent ... ans = 32.0727 Theta found by gradient descent: -3.630291 1.166362 For population = 35,000, we predict a profit of 4519.767868 For population = 70,000, we predict a profit of 45342.450129 Program paused. Press enter to continue. Visualizing J(theta_0, theta_1) ...
我们把cost function J的值在(theta_0, theta_1)上进行visualization可以得到
下面这张图是在(theta_0,theta_1)上的投影等高线图,红叉处就是GD收敛到的最小值处。对于linear regression只有全局最优解,所以这个也是我们想要的最优参数。
3.2 多变量的Linear Regression
如果每个训练样本用多个feature来描述,这就是多变量的Linear Regression问题。比如我们想根据房子的面积和卧室个数来预测房子的价格,那么现在每个训练样本就是用2个feature来描述。主程序如下
%% Initialization %% ================ Part 1: Feature Normalization ================ %% Clear and Close Figures clear ; close all; clc fprintf('Loading data ...\n'); %% Load Data data = load('ex1data2.txt'); X = data(:, 1:2); y = data(:, 3); m = length(y); % Print out some data points fprintf('First 10 examples from the dataset: \n'); fprintf(' x = [%.0f %.0f], y = %.0f \n', [X(1:10,:) y(1:10,:)]'); fprintf('Program paused. Press enter to continue.\n'); pause; % Scale features and set them to zero mean fprintf('Normalizing Features ...\n'); [X mu sigma] = featureNormalize(X); % Add intercept term to X X = [ones(m, 1) X]; %% ================ Part 2: Gradient Descent ================ % ====================== YOUR CODE HERE ====================== % Instructions: We have provided you with the following starter % code that runs gradient descent with a particular % learning rate (alpha). % % Your task is to first make sure that your functions - % computeCost and gradientDescent already work with % this starter code and support multiple variables. % % After that, try running gradient descent with % different values of alpha and see which one gives % you the best result. % % Finally, you should complete the code at the end % to predict the price of a 1650 sq-ft, 3 br house. % % Hint: By using the 'hold on' command, you can plot multiple % graphs on the same figure. % % Hint: At prediction, make sure you do the same feature normalization. % fprintf('Running gradient descent ...\n'); % Choose some alpha value alpha = 0.01; num_iters = 1000; % Init Theta and Run Gradient Descent theta = zeros(3, 1); [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters); % Plot the convergence graph figure; plot(1:numel(J_history), J_history, '-b', 'LineWidth', 2); xlabel('Number of iterations'); ylabel('Cost J'); % Display gradient descent's result fprintf('Theta computed from gradient descent: \n'); fprintf(' %f \n', theta); fprintf('\n'); % Estimate the price of a 1650 sq-ft, 3 br house % ====================== YOUR CODE HERE ====================== % Recall that the first column of X is all-ones. Thus, it does % not need to be normalized. x_predict = [1 1650 3]; for i=2:3 x_predict(i) = (x_predict(i) - mu(i-1)) / sigma(i-1); end price = x_predict * theta; % ============================================================ fprintf(['Predicted price of a 1650 sq-ft, 3 br house ' ... '(using gradient descent):\n $%f\n'], price); fprintf('Program paused. Press enter to continue.\n'); pause; %% ================ Part 3: Normal Equations ================ fprintf('Solving with normal equations...\n'); % ====================== YOUR CODE HERE ====================== % Instructions: The following code computes the closed form % solution for linear regression using the normal % equations. You should complete the code in % normalEqn.m % % After doing so, you should complete this code % to predict the price of a 1650 sq-ft, 3 br house. % %% Load Data data = csvread('ex1data2.txt'); X = data(:, 1:2); y = data(:, 3); m = length(y); % Add intercept term to X X = [ones(m, 1) X]; % Calculate the parameters from the normal equation theta = normalEqn(X, y); % Display normal equation's result fprintf('Theta computed from the normal equations: \n'); fprintf(' %f \n', theta); fprintf('\n'); % Estimate the price of a 1650 sq-ft, 3 br house % ====================== YOUR CODE HERE ====================== x_predict = [1 1650 3]; price = x_predict * theta; % ============================================================ fprintf(['Predicted price of a 1650 sq-ft, 3 br house ' ... '(using normal equations):\n $%f\n'], price);
3.2.1 Feature Normalization
通过观察feature的特征可以知道,房子的面积的数值大约是卧室个数数值的1000倍左右,当遇到不同feature的数值范围差异非常显著的情况,需要先进行feature normalization,这样可以加快learning算法的收敛。要进行Feature Normalization,需要首先对每一列feature值计算均值\mu和标准差\sigma,然后normalization/scale 之后的feature值x'与原始feature值x满足 x' = (x - \mu) / \sigma 。即把原始的feature减去均值然后除以标准差。因此我们可以这样实现feature normalization的函数
3.2.2 Gradient Descent
上面给出的单变量情形的代价函数和参数update rule同样适用于多变量情形,只是现在X有很多列,同样支持。注意这个时候没有办法在(\theta_0,\theta_1,\theta_2)上面可视化代价函数J,一共有四维。但是可以画出代价函数J随迭代次数的变化曲线如下
这里设置的learning rate \alpha = 0.01,迭代1000次,可以看出在400次左右时代价函数J就几乎收敛,不再变化。我们也可以调节learning rate \alpha, 选取合适的learning rate很重要,选得太小收敛很慢,选得太大有可能无法收敛(每次迭代参数变化太大,没法找到极值点)Ng建议选取\alpha时按照log scale,比如不断除以3,0.3 , 0.1 , 0.03 , 0.01 ...
3.2.3 Normal Equations
Alternately, 我们也可以直接用下面这个公式来计算最优的\theta,推导过程是代价函数对参数向量\theta求导数,令导数为0.
function [theta] = normalEqn(X, y) %NORMALEQN Computes the closed-form solution to linear regression % NORMALEQN(X,y) computes the closed-form solution to linear % regression using the normal equations. theta = zeros(size(X, 2), 1); % ====================== YOUR CODE HERE ====================== % Instructions: Complete the code to compute the closed form solution % to linear regression and put the result in theta. % theta = inv(X'* X)*X'*y; % ============================================================ end
Normalizing Features ... Running gradient descent ... Theta computed from gradient descent: 340397.963535 109848.008460 -5866.454085 Predicted price of a 1650 sq-ft, 3 br house (using gradient descent): $293237.161479 Program paused. Press enter to continue. Solving with normal equations... Theta computed from the normal equations: 89597.909543 139.210674 -8738.019112 Predicted price of a 1650 sq-ft, 3 br house (using normal equations): $293081.464335两次求的参数不同,因为前者有feature normalization,后者没有。对1650 sq-ft, 3 br house的房子预测的房价都在29万美元左右。