上一文中,我们为了方便演示,分别对 θ = [ θ 0 , θ 1 ] \theta =[\theta_0, \theta_1] θ=[θ0,θ1]设置了不同的学习率。但是如果参数过多(如神经网络),对每个参数设置不同的学习率未免太过繁琐且不切实际,然而相同学习率并不一定适合所有参数,比如上篇中η0 = 0.1
远大于η1 = 0.002/0.005
。
所以对于参数 θ \theta θ:
所以能不能通过参数空间的不同而自适应的调整学习率呢?
AdaGrad在训练过程中动态调整学习率,对不同参数根据累计梯度平方和更新不同学习率。其公式如下:
s : = s + ∇ θ J ( θ ) ⊙ ∇ θ J ( θ ) s := s + \nabla_\theta J(\theta) \odot \nabla_\theta J(\theta) s:=s+∇θJ(θ)⊙∇θJ(θ) θ : = θ − η s + ϵ ⊙ ∇ θ J ( θ ) \theta := \theta - \frac{\eta}{\sqrt{s+ \epsilon}}\odot\nabla_\theta J(\theta) θ:=θ−s+ϵη⊙∇θJ(θ)其中 ⊙ \odot ⊙是点乘,相当于求梯度的平方。 ϵ \epsilon ϵ为防止除0及维持数据稳定的极小项,这里我们取 1 0 − 6 10^{-6} 10−6。
因为 s s s是关于梯度平方和的累加项,所以:
迭代后期靠近极值点时需设置较小的学习率的直观想法
。优点:每个变量有自己的节奏。
缺点:由于学习率的不断衰减,在迭代过程早期衰减过快可能直接导致后期收敛动力不足,使得AdaGrad无法获得满意的结果。
下图显示了使用小批量梯度下降与Adagrad且learning rate = 8, batchSize = 32, iterations = 50
的迭代过程。
下面是AdaGrad的matlab相关代码:
% Writen by weichen GU, data 4/23th/2020
clear, clf, clc;
data = linspace(-20,20,100); % x range
col = length(data); % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10]; % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]'; % X ->[1;X];
t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);
theta =[-30;-4]; % Initialize parameters
LRate = 8 % Learning rate
thresh = 0.5; % Threshold of loss for jumping iteration
iteration = 50; % The number of teration
lineX = linspace(-30,30,100);
[row, col] = size(data) % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)]; % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
loss = getLoss(X,data(2,:)',col,theta); % Obtain current loss value
subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;
% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');
% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')
% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using AdaGrad');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');
set(gca,'XLim',[0 iteration]);
%set(gca,'YLim',[0 4000]);
hold on;
batchSize = 32;
s = 0;
for iter = 1 : iteration
delete(hLine) % set(hLine,'visible','off')
%[thetaOut] = GD(X,data(2,:)',theta,LRate); % Gradient Descent algorithm
[thetaOut,s] = MBGD(X,data(2,:)',theta,LRate,batchSize,s);
subplot(2,2,3);
line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')
theta = thetaOut;
loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
subplot(2,2,1);
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
%legend('training data','linear regression');
subplot(2,2,2);
scatter3(theta(1),theta(2),loss,'r*');
subplot(2,2,3);
plot(theta(1),theta(2),'r*')
subplot(2,2,4)
plot(iter,loss,'b*');
drawnow();
if(loss < thresh)
break;
end
end
hold off
function [Z] = getTotalCost(X,Y, num,meshX,meshY);
[row,col] = size(meshX);
Z = zeros(row, col);
for i = 1 : row
theta = [meshX(i,:); meshY(i,:)];
Z(i,:) = 1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);
end
end
function [Z] = getLoss(X,Y, num,theta)
Z= 1/(2*num)*sum((X*theta-Y).^2);
end
function [thetaOut] = GD(X,Y,theta,eta)
dataSize = length(X); % Obtain the number of data
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
thetaOut = theta -eta.*dx; % Update parameters(theta)
end
function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = s + dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
% @ Depscription:
% Mini-batch Gradient Descent (MBGD)
% Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
% X - [1 X_] X_ is actual X; Y - actual Y
% theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
% eta - learning rate;
%
function [thetaOut,s] = MBGD(X,Y,theta, eta,batchSize,s)
dataSize = length(X); % obtain the number of data
k = fix(dataSize/batchSize); % obtain the number of batch which has absolutely same size: k = batchNum-1;
batchIdx = randperm(dataSize); % randomly sort for every epoch for achiving sample diversity
batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize); % batches which has absolutely same size
batchIdx2 = batchIdx(k*batchSize+1:end); % ramained batch
for i = 1 : k
%thetaOut = GD(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta);
[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
end
if(~isempty(batchIdx2))
%thetaOut = GD(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta);
[thetaOut,s] = AdaGrad(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s);
end
end
针对于AdaGrad的学习率衰减过快缺点,RMSProp通过指数加权移动平均(累计局部梯度信息)替代累计平方梯度和来优化AdaGrad,使得远离当前点的梯度贡献小。其具体公式如下:
s : = β ⋅ s + ( 1 − β ) ⋅ ∇ θ J ( θ ) ⊙ ∇ θ J ( θ ) s := \beta\cdot s+(1-\beta)\cdot\nabla_\theta J(\theta)\odot\nabla_\theta J(\theta) s:=β⋅s+(1−β)⋅∇θJ(θ)⊙∇θJ(θ) θ : = θ − η s + ϵ ⊙ ∇ θ J ( θ ) \theta := \theta -\frac{\eta}{\sqrt{s+\epsilon}}\odot\nabla_\theta J(\theta) θ:=θ−s+ϵη⊙∇θJ(θ)其中 β \beta β为RMSProp的衰减因子,下文中我们取β = 0.99
。 s s s为关于梯度的指数加权移动平方和,初始值为0
。 ⊙ \odot ⊙为点乘,即对应项乘积。超参数一般设置为β=0.9, η=0.001
。
优点:在Adagrad基础上添加衰减因子,在学习率更新过程中权衡过去与当前的梯度信息,减轻了因梯度不断累计导致学习率大幅降低的影响,防止学习过早结束。
缺点:引入了超参数 β \beta β,增加模型复杂性。同时依赖全局学习率 η \eta η。
这里我们使用更小的学习率η = 0.5
来对比Adagrad和RMSProp的收敛效果:
(η = 0.5)
通过对比,我们可以看到,RMSProp减轻了因梯度不断累计导致学习率大幅降低的影响,防止学习过早结束。
在这里 θ 1 \theta_1 θ1方向为什么出现了震荡?这是由于 θ 1 \theta_1 θ1方向的学习率过大主因素
以及衰减因子约束不够次因素
共同作用的结果。
下面给出RMSProp的matlab代码:
% Writen by weichen GU, data 4/23th/2020
clear, clf, clc;
data = linspace(-20,20,100); % x range
col = length(data); % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10]; % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]'; % X ->[1;X];
t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);
theta =[-30;-4]; % Initialize parameters
LRate = 0.5; % Learning rate
thresh = 0.5; % Threshold of loss for jumping iteration
iteration = 50; % The number of teration
lineX = linspace(-30,30,100);
[row, col] = size(data) % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)]; % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
loss = getLoss(X,data(2,:)',col,theta); % Obtain current loss value
subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;
% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');
% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')
% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using RMSProp');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');
set(gca,'XLim',[0 iteration]);
%set(gca,'YLim',[0 4000]);
hold on;
batchSize = 32;
s = 0;
beta = 0.99;
for iter = 1 : iteration
delete(hLine) % set(hLine,'visible','off')
[thetaOut,s] = MBGD(X,data(2,:)',theta,LRate,batchSize,s,beta);
subplot(2,2,3);
line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')
theta = thetaOut;
loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
subplot(2,2,1);
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
%legend('training data','linear regression');
subplot(2,2,2);
scatter3(theta(1),theta(2),loss,'r*');
subplot(2,2,3);
plot(theta(1),theta(2),'r*')
subplot(2,2,4)
plot(iter,loss,'b*');
drawnow();
if(loss < thresh)
break;
end
end
hold off
function [Z] = getTotalCost(X,Y, num,meshX,meshY);
[row,col] = size(meshX);
Z = zeros(row, col);
for i = 1 : row
theta = [meshX(i,:); meshY(i,:)];
Z(i,:) = 1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);
end
end
function [Z] = getLoss(X,Y, num,theta)
Z= 1/(2*num)*sum((X*theta-Y).^2);
end
function [thetaOut] = GD(X,Y,theta,eta)
dataSize = length(X); % Obtain the number of data
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
thetaOut = theta -eta.*dx; % Update parameters(theta)
end
function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = s + dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
function [thetaOut,s] = RMSProp(X,Y,theta,eta,s,beta)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = beta*s + (1-beta)*dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
% @ Depscription:
% Mini-batch Gradient Descent (MBGD)
% Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
% X - [1 X_] X_ is actual X; Y - actual Y
% theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
% eta - learning rate;
%
function [thetaOut,s] = MBGD(X,Y,theta, eta,batchSize,s,beta)
dataSize = length(X); % obtain the number of data
k = fix(dataSize/batchSize); % obtain the number of batch which has absolutely same size: k = batchNum-1;
batchIdx = randperm(dataSize); % randomly sort for every epoch for achiving sample diversity
batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize); % batches which has absolutely same size
batchIdx2 = batchIdx(k*batchSize+1:end); % ramained batch
for i = 1 : k
%[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
[thetaOut,s] = RMSProp(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s,beta);
end
if(~isempty(batchIdx2))
%[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
[thetaOut,s] = RMSProp(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s,beta);
end
end
AdaDelta是针对于Adagrad的另一种优化,它相对于RMSProp,使用参数θ变化量的指数加权移动平方和替换了全局学习率 η \eta η。其思想是利用一阶方法近似模拟二阶牛顿法。其更新公式如下:
s g : = β ⋅ s g + ( 1 − β ) ⋅ ∇ θ J ( θ ) ⊙ ∇ θ J ( θ ) {s}_g := \beta\cdot {s}_g + (1-\beta)\cdot\nabla_\theta J(\theta)\odot\nabla_\theta J(\theta) sg:=β⋅sg+(1−β)⋅∇θJ(θ)⊙∇θJ(θ) Δ θ : = R M S [ Δ θ ] R M S [ ∇ θ J ( θ ) ] ⋅ ∇ θ J ( θ ) = s Δ θ + ϵ s g + ϵ ⋅ ∇ θ J ( θ ) \Delta \theta := \frac{RMS[\Delta \theta]}{RMS[\nabla_\theta J(\theta)]}\cdot\nabla_\theta J(\theta) = \sqrt{\frac{s_{\Delta \theta+\epsilon}}{s_g+\epsilon}}\cdot\nabla_\theta J(\theta) Δθ:=RMS[∇θJ(θ)]RMS[Δθ]⋅∇θJ(θ)=sg+ϵsΔθ+ϵ⋅∇θJ(θ) s Δ θ : = β ⋅ s Δ θ + ( 1 − β ) ⋅ Δ θ ⊙ Δ θ s_{\Delta \theta} := \beta \cdot s_{\Delta \theta}+(1-\beta)\cdot \Delta \theta\odot\Delta \theta sΔθ:=β⋅sΔθ+(1−β)⋅Δθ⊙Δθ θ : = θ − Δ θ \theta := \theta-\Delta \theta θ:=θ−Δθ s g {s}_g sg为关于梯度的指数加权移动平方和, s Δ θ {s}_{\Delta\theta} sΔθ是关于参数θ变化量的指数加权移动平方和。二者初始值设为0
。 ϵ \epsilon ϵ是维持数据稳定的常数,一般设置为 1 0 − 6 10^{-6} 10−6。
在AdaDelta优化中,分子可以看成一个动量加速项,通过指数加权方式累积先前的梯度变化量。分母项则是与RMSProp一样,所以也可以将RMSProp看成是AdaDelta的一种特殊情况。
优点:不需要人工设置学习率。
这里我们使用衰减因子β=0.5
的AdaDelta来观察迭代过程,因为我的数据在 θ 1 \theta_1 θ1方向方差较低,我在这里设置ε=10-4
,使得学习率不会太低。下面是AdaDelta的迭代过程:
这里附上AdaDelta的matlab代码,大家可以尝试调整超参数 β \beta β以及 ϵ \epsilon ϵ的值来观察收敛效果:
% Writen by weichen GU, data 4/23th/2020
clear, clf, clc;
data = linspace(-20,20,100); % x range
col = length(data); % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10]; % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]'; % X ->[1;X];
t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);
theta =[-30;-4]; % Initialize parameters
%LRate = 0.5; % Learning rate
thresh = 0.5; % Threshold of loss for jumping iteration
iteration = 300; % The number of teration
lineX = linspace(-30,30,100);
[row, col] = size(data) % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)]; % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
loss = getLoss(X,data(2,:)',col,theta); % Obtain current loss value
subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;
% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');
% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')
% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using AdaDelta');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');
set(gca,'XLim',[0 iteration]);
hold on;
batchSize = 32;
s_g = 0; s_t = 0;
beta = 0.5;
for iter = 1 : iteration
delete(hLine) % set(hLine,'visible','off')
[thetaOut,s_g,s_t] = MBGD(X,data(2,:)',theta,batchSize,s_g,s_t,beta);
subplot(2,2,3);
line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')
theta = thetaOut;
loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
subplot(2,2,1);
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
%legend('training data','linear regression');
subplot(2,2,2);
scatter3(theta(1),theta(2),loss,'r*');
subplot(2,2,3);
plot(theta(1),theta(2),'r*')
subplot(2,2,4)
plot(iter,loss,'b*');
drawnow();
if(loss < thresh)
break;
end
end
hold off
function [Z] = getTotalCost(X,Y, num,meshX,meshY);
[row,col] = size(meshX);
Z = zeros(row, col);
for i = 1 : row
theta = [meshX(i,:); meshY(i,:)];
Z(i,:) = 1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);
end
end
function [Z] = getLoss(X,Y, num,theta)
Z= 1/(2*num)*sum((X*theta-Y).^2);
end
function [thetaOut] = GD(X,Y,theta,eta)
dataSize = length(X); % Obtain the number of data
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
thetaOut = theta -eta.*dx; % Update parameters(theta)
end
function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = s + dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
function [thetaOut,s] = RMSProp(X,Y,theta,eta,s,beta)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = beta*s + (1-beta)*dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
function [thetaOut,s_g, s_t] = AdaDelta(X,Y,theta,s_g,s_t,beta)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-4*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s_g = beta*s_g + (1-beta)*dx.*dx;
dt = sqrt((s_t+eps)./(s_g+eps)).*dx;
s_t = beta*s_t + (1-beta)*dt.*dt;
thetaOut = theta - dt; % Update parameters(theta)
end
% @ Depscription:
% Mini-batch Gradient Descent (MBGD)
% Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
% X - [1 X_] X_ is actual X; Y - actual Y
% theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
% etaK, etaB - learning rate;
%
function [thetaOut,s_g, s_t] = MBGD(X,Y,theta,batchSize,s_g,s_t,beta)
dataSize = length(X); % obtain the number of data
k = fix(dataSize/batchSize); % obtain the number of batch which has absolutely same size: k = batchNum-1;
batchIdx = randperm(dataSize); % randomly sort for every epoch for achiving sample diversity
batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize); % batches which has absolutely same size
batchIdx2 = batchIdx(k*batchSize+1:end); % ramained batch
for i = 1 : k
[thetaOut,s_g, s_t] = AdaDelta(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,s_g,s_t,beta);
end
if(~isempty(batchIdx2))
[thetaOut,s_g, s_t] = AdaDelta(X(batchIdx2,:),Y(batchIdx2),thetaOut,s_g,s_t,beta);
end
end
Adam融合了RMSProp及Momentum的思想,做到了学习率自适应和动量加速收敛的效果。关于动量梯度下降,可以参考上一篇。其具体公式如下:
m : = γ ⋅ m + ( 1 − γ ) ⋅ ∇ θ J ( θ ) m := \gamma\cdot m + (1-\gamma)\cdot\nabla_\theta J(\theta) m:=γ⋅m+(1−γ)⋅∇θJ(θ) s : = β ⋅ s + ( 1 − β ) ⋅ ∇ θ J ( θ ) ⊙ ∇ θ J ( θ ) s := \beta\cdot s+(1-\beta)\cdot\nabla_\theta J(\theta)\odot\nabla_\theta J(\theta) s:=β⋅s+(1−β)⋅∇θJ(θ)⊙∇θJ(θ) m ^ : = m 1 − γ t \hat{m} := \frac{m}{1-\gamma^t} m^:=1−γtm s ^ : = s 1 − β t \hat s := \frac{s}{1-\beta^t} s^:=1−βts θ : = θ − η s ^ + ϵ ⊙ m ^ \theta := \theta - \frac{\eta}{\sqrt{\hat s+\epsilon}}\odot \hat m θ:=θ−s^+ϵη⊙m^ s ^ \hat s s^和 m ^ \hat m m^为 s s s和 m m m偏差修正的值,使得过去梯度权值和为1,防止值过小。
超参数一般设置为β=0.999, γ=0.9, ε=10^-8
。
下图展示了使用Adam的迭代收敛过程。这里我们使用和上文RMSProp相同的学习率η = 0.5
及衰减因子β=0.99
,同时设置动量衰减因子γ=0.9
。通过与RMSProp相对比,具有动量特性的Adam在 θ 0 \theta_0 θ0方向加速收敛,在 θ 1 \theta1 θ1方向抑制了震荡。
这里附上Adam相关代码实现:
% Writen by weichen GU, data 4/19th/2020
clear, clf, clc;
data = linspace(-20,20,100); % x range
col = length(data); % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10]; % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]'; % X ->[1;X];
t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);
theta =[-30;-4]; % Initialize parameters
LRate = 0.5; % Learning rate
thresh = 0.5; % Threshold of loss for jumping iteration
iteration = 100; % The number of teration
lineX = linspace(-30,30,100);
[row, col] = size(data) % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)]; % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
loss = getLoss(X,data(2,:)',col,theta); % Obtain current loss value
subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;
% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');
% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')
% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using Adam');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');
set(gca,'XLim',[0 iteration]);
%set(gca,'YLim',[0 4000]);
hold on;
batchSize = 32;
s = 0;
beta = 0.99;
momentum = 0;
gamma = 0.9;
cnt = 0;
for iter = 1 : iteration
cnt = cnt+1;
delete(hLine) % set(hLine,'visible','off')
[thetaOut,s,momentum] = MBGD(X,data(2,:)',theta,LRate,batchSize,s,beta,momentum,gamma,cnt);
subplot(2,2,3);
line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')
theta = thetaOut;
loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
subplot(2,2,1);
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
%legend('training data','linear regression');
subplot(2,2,2);
scatter3(theta(1),theta(2),loss,'r*');
subplot(2,2,3);
plot(theta(1),theta(2),'r*')
subplot(2,2,4)
plot(iter,loss,'b*');
drawnow();
if(loss < thresh)
break;
end
end
hold off
function [Z] = getTotalCost(X,Y, num,meshX,meshY);
[row,col] = size(meshX);
Z = zeros(row, col);
for i = 1 : row
theta = [meshX(i,:); meshY(i,:)];
Z(i,:) = 1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);
end
end
function [Z] = getLoss(X,Y, num,theta)
Z= 1/(2*num)*sum((X*theta-Y).^2);
end
function [thetaOut] = GD(X,Y,theta,eta)
dataSize = length(X); % Obtain the number of data
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
thetaOut = theta -eta.*dx; % Update parameters(theta)
end
function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = s + dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
function [thetaOut,s] = RMSProp(X,Y,theta,eta,s,decay)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = decay*s + (1-decay)*dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
function [thetaOut,s, momentum] = Adam(X,Y,theta,eta,s,beta,momentum,gamma,cnt)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = beta*s + (1-beta)*dx.*dx; % Update s
momentum = gamma*momentum + (1-gamma).*dx; % Update momentum
momentum_bar = momentum/(1-gamma^cnt);
s_bar = s /(1-beta^cnt);
thetaOut = theta - eta./sqrt(eps+s_bar).*momentum_bar; % Update parameters(theta)
end
% @ Depscription:
% Mini-batch Gradient Descent (MBGD)
% Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
% X - [1 X_] X_ is actual X; Y - actual Y
% theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
% etaK, etaB - learning rate;
%
function [thetaOut,s,momentum] = MBGD(X,Y,theta, eta,batchSize,s,beta,momentum,gamma,cnt)
dataSize = length(X); % obtain the number of data
k = fix(dataSize/batchSize); % obtain the number of batch which has absolutely same size: k = batchNum-1;
batchIdx = randperm(dataSize); % randomly sort for every epoch for achiving sample diversity
batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize); % batches which has absolutely same size
batchIdx2 = batchIdx(k*batchSize+1:end); % ramained batch
for i = 1 : k
%[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
[thetaOut,s,momentum] = Adam(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s,beta,momentum,gamma,cnt);
end
if(~isempty(batchIdx2))
%[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
[thetaOut,s,momentum] = Adam(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s,beta,momentum,gamma,cnt);
end
end
Nadam类似于带有Nesterov动量项的Adam,即使用超前梯度更新当前梯度,可以参考上一节NAG。其更新公式如下:
m : = γ ⋅ m + ( 1 − γ ) ⋅ ∇ θ J ( θ ) m := \gamma\cdot m + (1-\gamma)\cdot\nabla_\theta J(\theta) m:=γ⋅m+(1−γ)⋅∇θJ(θ) s : = β ⋅ s + ( 1 − β ) ⋅ ∇ θ J ( θ ) ⊙ ∇ θ J ( θ ) s := \beta\cdot s+(1-\beta)\cdot\nabla_\theta J(\theta)\odot\nabla_\theta J(\theta) s:=β⋅s+(1−β)⋅∇θJ(θ)⊙∇θJ(θ) g ^ = ∇ θ J ( θ ) 1 − γ t \hat g = \frac{\nabla_\theta J(\theta)}{1-\gamma^t} g^=1−γt∇θJ(θ) m ^ : = m 1 − γ t \hat{m} := \frac{m}{1-\gamma^t} m^:=1−γtm s ^ : = s 1 − β t \hat s := \frac{s}{1-\beta^t} s^:=1−βts m ˉ = ( 1 − γ ) ⋅ g ^ + γ ⋅ m ^ \bar{m} = (1-\gamma)\cdot\hat g+\gamma\cdot\hat{m} mˉ=(1−γ)⋅g^+γ⋅m^ θ : = θ − η s ^ + ϵ ⊙ m ˉ \theta := \theta - \frac{\eta}{\sqrt{\hat s+\epsilon}}\odot \bar m θ:=θ−s^+ϵη⊙mˉ
这里给出一个知乎用户-溪亭日暮的证明:
根据Adam,有
m : = γ ⋅ m ˙ + ( 1 − γ ) ⋅ ∇ θ J ( θ ) m :=\gamma\cdot\dot{m}+(1-\gamma)\cdot\nabla_\theta J(\theta) m:=γ⋅m˙+(1−γ)⋅∇θJ(θ) m ^ = m 1 − γ t \hat{m} = \frac{m}{1-\gamma^t} m^=1−γtm θ : = θ − η s ^ + ϵ ⊙ m \theta :=\theta-\frac{\eta}{\sqrt{\hat{s}+\epsilon}}\odot m θ:=θ−s^+ϵη⊙m这里 m ˙ \dot{m} m˙指当前的累计动量。于是有
θ : = θ − η s ^ + ϵ ⊙ ( γ ⋅ m ˙ 1 − γ t + ( 1 − γ ) ⋅ ∇ θ J ( θ ) 1 − γ t ) \theta := \theta-\frac{\eta}{\sqrt{\hat{s}+\epsilon}}\odot ({\frac{\gamma\cdot \dot{m}}{1-\gamma^t}}+\frac{(1-\gamma)\cdot\nabla_\theta J(\theta)}{1-\gamma^t}) θ:=θ−s^+ϵη⊙(1−γtγ⋅m˙+1−γt(1−γ)⋅∇θJ(θ))
因为 γ 1 − γ t \frac{\gamma}{1-\gamma^t} 1−γtγ是偏差修正项,所以可以转化为
θ : = θ − η s ^ + ϵ ⊙ ( γ ⋅ m ˙ ^ + ( 1 − γ ) ⋅ ∇ θ J ( θ ) 1 − γ t ) \theta := \theta-\frac{\eta}{\sqrt{\hat{s}+\epsilon}}\odot (\gamma\cdot \hat{\dot{m}}+\frac{(1-\gamma)\cdot\nabla_\theta J(\theta)}{1-\gamma^t}) θ:=θ−s^+ϵη⊙(γ⋅m˙^+1−γt(1−γ)⋅∇θJ(θ))对比传统动量算法展开项
θ : = θ − ( γ ⋅ m ˙ + η ⋅ g t ) \theta :=\theta-(\gamma\cdot\dot{m}+\eta\cdot g_t) θ:=θ−(γ⋅m˙+η⋅gt)这里 m ˙ \dot{m} m˙指当前累计动量。使用Nesterov动量替换当前累计动量后,我们得到
θ : = θ − η s ^ + ϵ ⊙ ( γ ⋅ m ^ + ( 1 − γ ) ⋅ ∇ θ J ( θ ) 1 − γ t ) \theta := \theta-\frac{\eta}{\sqrt{\hat{s}+\epsilon}}\odot (\gamma\cdot \hat{m}+\frac{(1-\gamma)\cdot\nabla_\theta J(\theta)}{1-\gamma^t}) θ:=θ−s^+ϵη⊙(γ⋅m^+1−γt(1−γ)⋅∇θJ(θ))
得证。为了与之前公式相一致,我没有引入 m t − 1 m_{t-1} mt−1代替 m ˙ \dot{m} m˙,大家如果看不明白可以直接查看链接。
下面展示了使用Nadam的迭代收敛过程:
可以看到,Nadam相比于Adam抑制了部分震荡。
以下是Nadam相关matlab代码:
% Writen by weichen GU, data 4/19th/2020
clear, clf, clc;
data = linspace(-20,20,100); % x range
col = length(data); % Obtain the number of x
data = [data;0.5*data + wgn(1,100,1)+10]; % Generate dataset - y = 0.5 * x + wgn^2 + 10;
X = [ones(1, col); data(1,:)]'; % X ->[1;X];
t1=-40:0.1:50;
t2=-4:0.1:4;
[meshX,meshY]=meshgrid(t1,t2);
meshZ = getTotalCost(X, data(2,:)', col, meshX,meshY);
theta =[-30;-4]; % Initialize parameters
LRate = 0.5; % Learning rate
thresh = 0.5; % Threshold of loss for jumping iteration
iteration = 100; % The number of teration
lineX = linspace(-30,30,100);
[row, col] = size(data) % Obtain the size of dataset
lineMy = [lineX;theta(1)*lineX+theta(2)]; % Fitting line
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
loss = getLoss(X,data(2,:)',col,theta); % Obtain current loss value
subplot(2,2,1);
plot(data(1,:),data(2,:),'r.','MarkerSize',10);
title('Data fiting using Univariate LR');
axis([-30,30,-10,30])
xlabel('x');
ylabel('y');
hold on;
% Draw 3d loss surfaces
subplot(2,2,2)
mesh(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('3D surfaces for loss')
hold on;
scatter3(theta(1),theta(2),loss,'r*');
% Draw loss contour figure
subplot(2,2,3)
contour(meshX,meshY,meshZ)
xlabel('θ_0');
ylabel('θ_1');
title('Contour figure for loss')
hold on;
plot(theta(1),theta(2),'r*')
% Draw loss with iteration
subplot(2,2,4)
hold on;
title('Loss when using Adam');
xlabel('iter');
ylabel('loss');
plot(0,loss,'b*');
set(gca,'XLim',[0 iteration]);
%set(gca,'YLim',[0 4000]);
hold on;
batchSize = 32;
s = 0;
beta = 0.99;
momentum = 0;
gamma = 0.9;
cnt = 0;
for iter = 1 : iteration
cnt = cnt+1;
delete(hLine) % set(hLine,'visible','off')
[thetaOut,s,momentum] = MBGD(X,data(2,:)',theta,LRate,batchSize,s,beta,momentum,gamma,cnt);
subplot(2,2,3);
line([theta(1),thetaOut(1)],[theta(2),thetaOut(2)],'color','k')
theta = thetaOut;
loss = getLoss(X,data(2,:)',col,theta); % Obtain losw
lineMy(2,:) = theta(2)*lineX+theta(1); % Fitting line
subplot(2,2,1);
hLine = plot(lineMy(1,:),lineMy(2,:),'c','linewidth',2); % draw fitting line
%legend('training data','linear regression');
subplot(2,2,2);
scatter3(theta(1),theta(2),loss,'r*');
subplot(2,2,3);
plot(theta(1),theta(2),'r*')
subplot(2,2,4)
plot(iter,loss,'b*');
drawnow();
if(loss < thresh)
break;
end
end
hold off
function [Z] = getTotalCost(X,Y, num,meshX,meshY);
[row,col] = size(meshX);
Z = zeros(row, col);
for i = 1 : row
theta = [meshX(i,:); meshY(i,:)];
Z(i,:) = 1/(2*num)*sum((X*theta-repmat(Y,1,col)).^2);
end
end
function [Z] = getLoss(X,Y, num,theta)
Z= 1/(2*num)*sum((X*theta-Y).^2);
end
function [thetaOut] = GD(X,Y,theta,eta)
dataSize = length(X); % Obtain the number of data
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
thetaOut = theta -eta.*dx; % Update parameters(theta)
end
function [thetaOut,s] = AdaGrad(X,Y,theta,eta,s)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = s + dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
function [thetaOut,s] = RMSProp(X,Y,theta,eta,s,decay)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = decay*s + (1-decay)*dx.*dx;
thetaOut = theta -eta./sqrt(eps+s).*dx; % Update parameters(theta)
end
function [thetaOut,s, momentum] = Adam(X,Y,theta,eta,s,beta,momentum,gamma,cnt)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
s = beta*s + (1-beta)*dx.*dx; % Update s
momentum = gamma*momentum + (1-gamma).*dx; % Update momentum
momentum_bias_correction = momentum/(1-gamma^cnt);
s_bias_correction = s /(1-beta^cnt);
thetaOut = theta - eta./sqrt(eps+s_bias_correction).*momentum_bias_correction; % Update parameters(theta)
end
function [thetaOut,s, momentum] = Nadam(X,Y,theta,eta,s,beta,momentum,gamma,cnt)
[dataSize,col] = size(X); % Obtain the number of data
eps = 10^-7*ones(col,1);
dx = 1/dataSize.*(X'*(X*theta-Y)); % Obtain the gradient of Loss function
g_hat = dx/(1-gamma^cnt);
s = beta*s + (1-beta)*dx.*dx; % Update s
momentum = gamma*momentum + (1-gamma).*dx; % Update momentum
momentum_hat = momentum/(1-gamma^cnt);
s_hat = s /(1-beta^cnt);
m_bar = (1-gamma)*g_hat+gamma*momentum_hat;
thetaOut = theta - eta./sqrt(eps+s_hat).*m_bar; % Update parameters(theta)
end
% @ Depscription:
% Mini-batch Gradient Descent (MBGD)
% Stochastic Gradient Descent(batchSize = 1) (SGD)
% @ param:
% X - [1 X_] X_ is actual X; Y - actual Y
% theta - theta for univariate linear regression y_pred = theta_0 + theta1*x
% etaK, etaB - learning rate;
%
function [thetaOut,s,momentum] = MBGD(X,Y,theta, eta,batchSize,s,beta,momentum,gamma,cnt)
dataSize = length(X); % obtain the number of data
k = fix(dataSize/batchSize); % obtain the number of batch which has absolutely same size: k = batchNum-1;
batchIdx = randperm(dataSize); % randomly sort for every epoch for achiving sample diversity
batchIdx1 = reshape(batchIdx(1:k*batchSize),k,batchSize); % batches which has absolutely same size
batchIdx2 = batchIdx(k*batchSize+1:end); % ramained batch
for i = 1 : k
%[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
%[thetaOut,s,momentum] = Adam(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s,beta,momentum,gamma,cnt);
[thetaOut,s,momentum] = Nadam(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s,beta,momentum,gamma,cnt);
end
if(~isempty(batchIdx2))
%[thetaOut,s] = AdaGrad(X(batchIdx1(i,:),:),Y(batchIdx1(i,:)),theta,eta,s);
%[thetaOut,s,momentum] = Adam(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s,beta,momentum,gamma,cnt);
[thetaOut,s,momentum] = Nadam(X(batchIdx2,:),Y(batchIdx2),thetaOut,eta,s,beta,momentum,gamma,cnt);
end
end