首先,给出各函数的调用关系
主函数:train.m
(1)调用sampleIMAGES函数从已知图像中扣取多个图像块儿
(2)调用display_network函数,以网格的形式,随机显示多个扣取的图像块儿
(3)梯度校验,该部分的目的是测试函数是否正确,可以由单独的函数checkSparseAutoencoderCost实现
①利用sparseAutoencoderCost函数计算网路的代价函数和梯度值
②利用computeNumericalGradient函数计算梯度值(这里,要利用checkNumericalGradient函数验证该梯度计算函数是否正确)
③比较①和②的梯度计算结果,判断编写的sparseAutoencoderCost函数是否正确
如果sparseAutoencoderCost函数是正确的,那么,在实际训练中,不需要运行checkSparseAutoencoderCost
(4)利用L-BFGS方法对网络进行训练,从而得到最优化的网络的权值和偏执项
(5)对训练结果进行可视化
然后,对个函数给出注释
train.m
%% CS294A/CS294W Programming Assignment Starter Code addpath ..common\ %%====================================================================== %% STEP 0: Here we provide the relevant parameters values that will % allow your sparse autoencoder to get good filters; you do not need to change the parameters below. visibleSize = 8*8; % number of input units hiddenSize = 25; % number of hidden units sparsityParam = 0.01; % desired average activation of the hidden units. % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p", in the lecture notes). lambda = 0.0001; % weight decay parameter beta = 3; % weight of sparsity penalty term %%====================================================================== %% STEP 1: Implement sampleIMAGES % After implementing sampleIMAGES, the display_network command should display a random sample of 200 patches from the dataset %从图像中提取图像块儿,每一个提取到的图像块儿存放在patches的每一列中 patches = sampleIMAGES; %随机提取patches中的200列,然后显示这200列所对应的图像 IMG=patches(:,randi(size(patches,2),200,1)); display_network(IMG,8); %%====================================================================== %% STEP 2 and STEP 3:Implement sparseAutoencoderCost and Gradient Checking checkSparseAutoencoderCost() %%====================================================================== %% STEP 4: After verifying that your implementation of % Randomly initialize the parameters theta = initializeParameters(hiddenSize, visibleSize); % Use minFunc to minimize the function addpath minFunc/ options.Method = 'lbfgs'; % Here, we use L-BFGS to optimize our cost function % Generally, for minFunc to work, you need a function pointer with two outputs: the function value and the gradient. % In our problem, sparseAutoencoderCost.m satisfies this. options.maxIter = 400; % Maximum number of iterations of L-BFGS to run options.display = 'on'; % opttheta是整个神经网络的权值和偏执项构成的向量 [opttheta, cost] = minFunc( @(p) sparseAutoencoderCost(p, ... visibleSize, hiddenSize, ... lambda, sparsityParam, ... beta, patches), ... theta, options); %%====================================================================== %% STEP 5: Visualization W1 = reshape(opttheta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);%第一层的权值矩阵 display_network(W1', 12); print -djpeg weights.jpg % save the visualization to a file
checkSparseAutoencoderCost.m
%% 该函数主要目的是检验SparseAutoencoderCost函数是否正确 function checkSparseAutoencoderCost() %% 产生一个稀疏自编码网络(可以与主程序相同,也可以重新产生) visibleSize = 8*8; % number of input units hiddenSize = 25; % number of hidden units sparsityParam = 0.01; % desired average activation of the hidden units. % (This was denoted by the Greek alphabet rho, which looks like a lower-case "p", in the lecture notes). lambda = 0.0001; % weight decay parameter beta = 3; % weight of sparsity penalty term patches = sampleIMAGES; % Obtain random parameters theta theta = initializeParameters(hiddenSize, visibleSize); %% 计算代价函数和梯度 [cost, grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, lambda, ... sparsityParam, beta, patches(:,1:10)); %% 利用近似方法计算梯度(要调用自编码器的代价函数计算程序) numgrad = computeNumericalGradient( @(x) sparseAutoencoderCost(x, visibleSize, ... hiddenSize, lambda, ... sparsityParam, beta, ... patches(:,1:10)), theta); %% 比较cost函数计算得到的梯度和由近似得到的梯度之 % Use this to visually compare the gradients side by side disp([numgrad grad]); % Compare numerically computed gradients with the ones obtained from backpropagation diff = norm(numgrad-grad)/norm(numgrad+grad); disp(diff); % Should be small. In our implementation, these values are usually less than 1e-9. end
sparseAutoencoderCost.m
%% 计算网络的代价函数和梯度 function [cost,grad] = sparseAutoencoderCost(theta, visibleSize, hiddenSize, ... lambda, sparsityParam, beta, data) % visibleSize: the number of input units (probably 64) % hiddenSize: the number of hidden units (probably 25) % lambda: weight decay parameter % sparsityParam: The desired average activation for the hidden units (denoted in the lecture % notes by the greek alphabet rho, which looks like a lower-case "p"). % beta: weight of sparsity penalty term % data: Our 64x10000 matrix containing the training data. So, data(:,i) is the i-th training example. % The input theta is a vector (because minFunc expects the parameters to be a vector). % We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this % follows the notation convention of the lecture notes. W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize); W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize); b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize); b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end); % Cost and gradient variables (your code needs to compute these values). % Here, we initialize them to zeros. cost = 0; m=size(data,2); %% ---------- YOUR CODE HERE -------------------------------------- % Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder, % and the corresponding gradients W1grad, W2grad, b1grad, b2grad. % % W1grad, W2grad, b1grad and b2grad should be computed using backpropagation. % Note that W1grad has the same dimensions as W1, b1grad has the same dimensions % as b1, etc. Your code should set W1grad to be the partial derivative of J_sparse(W,b) with % respect to W1. I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) % with respect to the input parameter W1(i,j). Thus, W1grad should be equal to the term % [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 % of the lecture notes (and similarly for W2grad, b1grad, b2grad). % % Stated differently, if we were using batch gradient descent to optimize the parameters, % the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. % %% 前向传播算法 a1=data; z2=bsxfun(@plus,W1*a1,b1); a2=sigmoid(z2); z3=bsxfun(@plus,W2*a2,b2); a3=sigmoid(z3); %% 计算网络误差 % 误差项J1=所有样本代价函数均值 y=data; % 网络的理想输出值 Ei=sum((a3-y).^2)/2; %每一个样本的代价函数 J1=sum(Ei)/m; % 正则化项J2=所有权值项平方和 J2=sum(W1(:).^2)+sum(W2(:).^2); % 稀疏项J3=所有隐藏层的神经元相对熵之和 rho_hat=sum(a2,2)/m; KL=sum(sparsityParam*log(sparsityParam./rho_hat)+... (1-sparsityParam)*log((1-sparsityParam)./(1-rho_hat))); J3=KL; % 网络的代价函数 cost=J1+lambda*J2/2+beta*J3; %% 反向传播算法计算各层敏感度delta delta3=-(data-a3).*dsigmoid(z3); spare_delta=beta*(-sparsityParam./rho_hat+(1-sparsityParam)./(1-rho_hat)); delta2=bsxfun(@plus,W2'*delta3,spare_delta).*dsigmoid(z2); % 这里加入了稀疏项 %% 计算代价函数对各层权值和偏执项的梯度 W1grad=delta2*a1'/m+lambda*W1; W2grad=delta3*a2'/m+lambda*W2; b1grad=sum(delta2,2)/m; b2grad=sum(delta3,2)/m; %------------------------------------------------------------------- % After computing the cost and gradient, we will convert the gradients back % to a vector format (suitable for minFunc). Specifically, we will unroll % your gradient matrices into a vector. grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)]; % end %------------------------------------------------------------------- % Here's an implementation of the sigmoid function, which you may find useful % in your computation of the costs and the gradients. This inputs a (row or % column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). function sigm = sigmoid(x) sigm = 1 ./ (1 + exp(-x)); end %% 求解sigmoid函数的导数(这里的计算公式一定要注意啊,出过错) function dsigm = dsigmoid(x) sigx = sigmoid(x); dsigm=sigx.*(1-sigx); end
梯度检验函数见另一篇博文