ufldl 深度学习入门 第5发 线性解码器

这一节其实很简单,所谓线性解码器,就是将输出层的激活函数换成了线性函数,其他的没有变。

输出层用线性函数替代sigmoid函数的目的是使输出不再受限在[0,1]范围内。

数据集采用STL-10 彩色图像,但是是8×8的patches,不是原始的64×64图像。

导入数据集:load stlSampledPatches.得到矩阵patches, 192*100000,192=8*8*3

后面在卷积和池化中会用原始的64×64的STL-10图像作为输入。


inputSize = 8×8×3 = 192; 3个通道的颜色。

隐藏层的节点数400,lambda = 3e-3,beta = 5,sparsity = 0.035,这里对输入数据进行预处理,使用ZCA白化,对应的参数epsilon = 0.1;


 STEP 1: Create and modify sparseAutoencoderLinearCost.m to use a linear decoder,

这一步借助前面的 sparseAutoencoderCost.m ,更改最后一层的激活函数。andrew ng有一个习惯很好,每一步及时验证保证正确性,这一步同样还是利用数值计算检查梯度计算是否正确,而且设置了debug时的输入层和隐藏层节点数(8,5),保证debug时候速度很快。

下面的代码与sparseAutoencoderCost.m基本一样,只是改了2行语句,我用%@@注释了。

function [cost,grad] = sparseAutoencoderLinearCost(theta, visibleSize, hiddenSize, ...
                                             lambda, sparsityParam, beta, data)
% -------------------- YOUR CODE HERE --------------------
% Instructions:
%   Copy sparseAutoencoderCost in sparseAutoencoderCost.m from your
%   earlier exercise onto this file, renaming the function to
%   sparseAutoencoderLinearCost, and changing the autoencoder to use a
%   linear decoder.
% -------------------- YOUR CODE HERE --------------------                                    



% visibleSize: the number of input units (probably 64) 
% hiddenSize: the number of hidden units (probably 25) 
% lambda: weight decay parameter
% sparsityParam: The desired average activation for the hidden units (denoted in the lecture
%                           notes by the greek alphabet rho, which looks like a lower-case "p").
% beta: weight of sparsity penalty term
% data: Our 64x10000 matrix containing the training data.  So, data(:,i) is the i-th training example. 
  
% The input theta is a vector (because minFunc expects the parameters to be a vector). 
% We first convert theta to the (W1, W2, b1, b2) matrix/vector format, so that this 
% follows the notation convention of the lecture notes. 

%convert theta to original matrix. 
%w1 is 25*64, w2 is 64*25, b1 is 25*1 ,b2 is 64*1
W1 = reshape(theta(1:hiddenSize*visibleSize), hiddenSize, visibleSize);
W2 = reshape(theta(hiddenSize*visibleSize+1:2*hiddenSize*visibleSize), visibleSize, hiddenSize);
b1 = theta(2*hiddenSize*visibleSize+1:2*hiddenSize*visibleSize+hiddenSize);
b2 = theta(2*hiddenSize*visibleSize+hiddenSize+1:end);

% Cost and gradient variables (your code needs to compute these values). 
% Here, we initialize them to zeros. 
cost = 0;
W1grad = zeros(size(W1)); 
W2grad = zeros(size(W2));
b1grad = zeros(size(b1)); 
b2grad = zeros(size(b2));

%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost/optimization objective J_sparse(W,b) for the Sparse Autoencoder,
%                and the corresponding gradients W1grad, W2grad, b1grad, b2grad.
%
% W1grad, W2grad, b1grad and b2grad should be computed using backpropagation.
% Note that W1grad has the same dimensions as W1, b1grad has the same dimensions
% as b1, etc.  Your code should set W1grad to be the partial derivative of J_sparse(W,b) with
% respect to W1.  I.e., W1grad(i,j) should be the partial derivative of J_sparse(W,b) 
% with respect to the input parameter W1(i,j).  Thus, W1grad should be equal to the term 
% [(1/m) \Delta W^{(1)} + \lambda W^{(1)}] in the last block of pseudo-code in Section 2.2 
% of the lecture notes (and similarly for W2grad, b1grad, b2grad).
% 
% Stated differently, if we were using batch gradient descent to optimize the parameters,
% the gradient descent update to W1 would be W1 := W1 - alpha * W1grad, and similarly for W2, b1, b2. 
% 

%1.forward propagation
data_size=size(data);
active_value2=repmat(b1,1,data_size(2));
active_value3=repmat(b2,1,data_size(2));
active_value2=sigmoid(W1*data+active_value2);
%active_value3=sigmoid(W2*active_value2+active_value3);
active_value3=(W2*active_value2+active_value3); %@@@@ here we use linear function
%2.computing error term and cost
ave_square=sum(sum((active_value3-data).^2)./2)/data_size(2);
weight_decay=lambda/2*(sum(sum(W1.^2))+sum(sum(W2.^2)));

p_real=sum(active_value2,2)./data_size(2);
p_para=repmat(sparsityParam,hiddenSize,1);
sparsity=beta.*sum(p_para.*log(p_para./p_real)+(1-p_para).*log((1-p_para)./(1-p_real)));
cost=ave_square+weight_decay+sparsity;
        % here compute the cost, in order to use in
        % computeNumericalGradient function

%delta3=(active_value3-data).*(active_value3).*(1-active_value3);
delta3=(active_value3-data);     %@@ here we simplified @@
average_sparsity=repmat(sum(active_value2,2)./data_size(2),1,data_size(2));
default_sparsity=repmat(sparsityParam,hiddenSize,data_size(2));
sparsity_penalty=beta.*(-(default_sparsity./average_sparsity)+((1-default_sparsity)./(1-average_sparsity)));
delta2=(W2'*delta3+sparsity_penalty).*((active_value2).*(1-active_value2));
%3.backword propagation
W2grad=delta3*active_value2'./data_size(2)+lambda.*W2;
W1grad=delta2*data'./data_size(2)+lambda.*W1;
b2grad=sum(delta3,2)./data_size(2);
b1grad=sum(delta2,2)./data_size(2);

%-------------------------------------------------------------------
% After computing the cost and gradient, we will convert the gradients back
% to a vector format (suitable for minFunc).  Specifically, we will unroll
% your gradient matrices into a vector.

grad = [W1grad(:) ; W2grad(:) ; b1grad(:) ; b2grad(:)];   % vector,size=25*64+64*25+25+64

end

%-------------------------------------------------------------------
% Here's an implementation of the sigmoid function, which you may find useful
% in your computation of the costs and the gradients.  This inputs a (row or
% column) vector (say (z1, z2, z3)) and returns (f(z1), f(z2), f(z3)). 

function sigm = sigmoid(x)
  
    sigm = 1 ./ (1 + exp(-x));
end

 
 

后面的都不用更改,直接使用就可以了。

在做梯度检验的时候有下面3句话,分析下:

diff = norm(numGrad-grad)/norm(numGrad+grad);
% Should be small. In our implementation, these values are usually less than 1e-9.
disp(diff); 

assert(diff < 1e-9, 'Difference too large. Check your gradient computation again');

STEP 2: Learn features on small patches

then,we load dataset:load stlSampledPatches.mat

得到矩阵 patches:192×100000,这里192=8×8×3,是彩色图像。

这里对数据用到了ZCA白化处理,有时间复习下线性代数。

% Apply ZCA whitening
sigma = patches * patches' / numPatches; %@@计算得到协方差矩阵
[u, s, v] = svd(sigma); %@@u返回特征向量,s返回特征值对角矩阵
ZCAWhite = u * diag(1 ./ sqrt(diag(s) + epsilon)) * u'; %@@这里使用epsilion是防止计算的时候发生溢出
patches = ZCAWhite * patches;

最后练习中建议我们存储一些训练得到的数据:save('STL10Features.mat', 'optTheta', 'ZCAWhite', 'meanPatch');

STL10Features.mat

optTheta 是我们训练得到的w b的参数;

ZCAWhite 是对原始的patches减去均值,使均值为零,然后ZCA白化使均方差为1,其实我并不明白为什么要做ZCA白化?求看博客的各位发表自己的意见。

meanPatch 显然就是对原始的patches减去均值,使均值为零得到的patches。

这一步实现的代码是

meanPatch = mean(patches, 2); %这里得到一个列向量,作用是对矩阵的每一行求均值,返回一个列向量
patches = bsxfun(@minus, patches, meanPatch);
ps:这里对图像的均值化方法我不赞同,举个例子,比如我对房屋进行分类,有3个房屋,特征矩阵如下。对于这个房屋的面积,我采用均值为零,并且归一化这种操作是很合适的,因为三个特征的数值大小相差比较大,这样做很合适。但是对多幅图像每个位置的像素点做均值归零处理,就没道理;图像每个像素点是没有意义的,它的意义在于跟周围像素点的组合信息,这才是图像的含义。而对多幅图像像素值减去均值的做法,实质上会破坏图像的局域关系,不符合人类识别图像的原理。

[150 200 100   %3个房屋的面积

 4      5      3     % 3个房屋的房间数

 200  400  120  % 3个房屋的售价

]













你可能感兴趣的:(函数,matlab,深度学习)