在预测中,x是关于y的变量,但是在train中,w是L的变量,x是不可能变化的。所以,知道为什么weights叫Variable了吧(强行瞎解释一发)
下面用tensorflow手动实现梯度下降:
为了方便写公式,下边的代码改了变量的命名,采用loss、prediction、gradient、weight、y、x等首字母表示,η表示学习率,w0、w1、w2等表示第几次迭代时w的值,不是多个变量。
loss=(y-p)^2=(y-w*x)^2=(y^2-2*y*w*x+w^2*x^2)
dl/dw = 2*w*x^2-2*y*x
代入梯度下降公式w1=w0-η*dL/dw|w=w0
w1 = w0-η*dL/dw|w=w0
w2 = w1 - η*dL/dw|w=w1
w3 = w2 - η*dL/dw|w=w2
初始:y=3,x=1,w=2,l=1,dl/dw=-2,η=1
更新:w=4
更新:w=2
更新:w=4
所以,本例x=1,y=3,dl/dw巧合的等于2w-2y,也就是二倍的prediction和label的差距。learning rate=1会导致w围绕正确的值来回徘徊,完全不收敛,这样写主要是方便演示计算。改小learning rate 并增加循环次数就能收敛了。
学习率大的话,大概就是这个效果
#demo4:manual gradient descent in tensorflow
#y label
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x
#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
learning_rate = tf.constant(1,dtype=tf.float32)
#learning_rate = tf.constant(0.11,dtype=tf.float32)
init = tf.global_variables_initializer()
#update
update = tf.assign(w, w - learning_rate * g[0])
with tf.Session() as sess:
sess.run(init)
print(sess.run([g,p,w], {x: 1}))
for _ in range(5):
w_,g_,l_ = sess.run([w,g,l],feed_dict={x:1})
print('variable is w:',w_, ' g is ',g_,' and the loss is ',l_)
_ = sess.run(update,feed_dict={x:1})
结果:
learning rate=1
[[-2.0], 2.0, 2.0]
variable is w: 2.0 g is [-2.0] and the loss is 1.0
variable is w: 4.0 g is [2.0] and the loss is 1.0
variable is w: 2.0 g is [-2.0] and the loss is 1.0
variable is w: 4.0 g is [2.0] and the loss is 1.0
variable is w: 2.0 g is [-2.0] and the loss is 1.0
缩小learning rate
variable is w: 2.9964619 g is [-0.007575512] and the loss is 1.4347095e-05
variable is w: 2.996695 g is [-0.0070762634] and the loss is 1.2518376e-05
variable is w: 2.996913 g is [-0.0066099167] and the loss is 1.0922749e-05
variable is w: 2.9971166 g is [-0.0061740875] and the loss is 9.529839e-06
variable is w: 2.9973066 g is [-0.0057668686] and the loss is 8.314193e-06
variable is w: 2.9974842 g is [-0.0053868294] and the loss is 7.2544826e-06
variable is w: 2.9976501 g is [-0.0050315857] and the loss is 6.3292136e-06
variable is w: 2.997805 g is [-0.004699707] and the loss is 5.5218115e-06
variable is w: 2.9979498 g is [-0.004389763] and the loss is 4.8175043e-06
variable is w: 2.998085 g is [-0.0041003227] and the loss is 4.2031616e-06
variable is w: 2.9982114 g is [-0.003829956] and the loss is 3.6671408e-06
variable is w: 2.9983294 g is [-0.0035772324] and the loss is 3.1991478e-06
注意,tensorflow中没有SGD(Stochastic Gradient Descent)这种梯度下降算法接口,SGD更像是一个喂数据的策略,而不是具体训练方法,按吴恩达教程,严格的说,SGD甚至一次只能训练一个样本,实际常见的更多是多个样本的mini-batch,只要喂数据的时候随机化就算是SGD(mini-batch)了。
链接:Gradient Descent、Momentum、Nesterov的实现及直觉对比
#demo5.2 tensorflow momentum
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x
#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
Mu = 0.8
LR = tf.constant(0.01,dtype=tf.float32)
init = tf.group(tf.global_variables_initializer(),tf.local_variables_initializer())
#update w
update = tf.train.MomentumOptimizer(LR, Mu).minimize(l)
with tf.Session() as sess:
sess.run(init)
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
print(sess.run([g,p,w], {x: 1}))
for _ in range(10):
w_,g_,l_ = sess.run([w,g,l],feed_dict={x:1})
print('variable is w:',w_, ' g is ',g_, ' and the loss is ',l_)
sess.run([update],feed_dict={x:1})
这是前几次迭代的数据,注意看,和下边的手动实现做对比
variable is w: 2.0 g is [-2.0] and the loss is 1.0
variable is w: 2.02 g is [-1.96] and the loss is 0.96040004
variable is w: 2.0556 g is [-1.8888001] and the loss is 0.8918915
variable is w: 2.102968 g is [-1.794064] and the loss is 0.80466646
variable is w: 2.158803 g is [-1.682394] and the loss is 0.7076124
variable is w: 2.220295 g is [-1.5594101] and the loss is 0.60793996
variable is w: 2.2850826 g is [-1.4298348] and the loss is 0.5111069
variable is w: 2.351211 g is [-1.2975779] and the loss is 0.42092708
variable is w: 2.4170897 g is [-1.1658206] and the loss is 0.3397844
variable is w: 2.4814508 g is [-1.0370984] and the loss is 0.26889327
#demo5.2:manual momentum in tensorflow
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x
#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
Mu = 0.8
LR = tf.constant(0.01,dtype=tf.float32)
#v = tf.Variable(0,tf.float32)#error?secend param is not dtype?
v = tf.Variable(0,dtype = tf.float32)
init = tf.global_variables_initializer()
#update w
update1 = tf.assign(v, Mu * v + g[0] * LR )
update2 = tf.assign(w, w - v)
#update = tf.group(update1,update2)#wrong sequence!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
with tf.Session() as sess:
sess.run(init)
print(sess.run([g,p,w], {x: 1}))
for _ in range(10):
w_,g_,l_,v_ = sess.run([w,g,l,v],feed_dict={x:1})
print('variable is w:',w_, ' g is ',g_, ' v is ',v_,' and the loss is ',l_)
_ = sess.run([update1],feed_dict={x:1})
_ = sess.run([update2],feed_dict={x:1})
注意看前边这组数据,和tf自动实现的是一样的。
variable is w: 2.0 g is [-2.0] v is 0.0 and the loss is 1.0
variable is w: 2.0 g is [-2.0] v is -0.02 and the loss is 1.0
variable is w: 2.02 g is [-1.96] v is -0.0356 and the loss is 0.96040004
variable is w: 2.0556 g is [-1.8888001] v is -0.047367997 and the loss is 0.8918915
variable is w: 2.102968 g is [-1.794064] v is -0.05583504 and the loss is 0.80466646
variable is w: 2.158803 g is [-1.682394] v is -0.06149197 and the loss is 0.7076124
variable is w: 2.220295 g is [-1.5594101] v is -0.06478768 and the loss is 0.60793996
variable is w: 2.2850826 g is [-1.4298348] v is -0.06612849 and the loss is 0.5111069
variable is w: 2.351211 g is [-1.2975779] v is -0.06587857 and the loss is 0.42092708
variable is w: 2.4170897 g is [-1.1658206] v is -0.06436106 and the loss is 0.3397844
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
variable is w: 2.9999995 g is [-9.536743e-07] v is -4.7683734e-08 and the loss is 2.2737368e-13
adagrad有点使用Hessian矩阵的意思,不过用的是近似二次导数,因为真求出二次导数,在深度学习中代价还是很大的。
#demo6:adagrad optimizer in tensorflow
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x
#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
LR = tf.constant(0.6,dtype=tf.float32)
optimizer = tf.train.AdagradOptimizer(LR)
update = optimizer.minimize(l)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
#print(sess.run([g,p,w], {x: 1}))
for _ in range(20):
w_,l_,g_ = sess.run([w,l,g],feed_dict={x:1})
print('variable is w:',w_, 'g:',g_ ,' and the loss is ',l_)
_ = sess.run(update,feed_dict={x:1})
可以用依赖关系。
#demo6.2:manual adagrad
#with tf.name_scope('initial'):
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype=tf.float32)
w = tf.Variable(2,dtype=tf.float32,expected_shape=[1])
second_derivative = tf.Variable(0,dtype=tf.float32)
LR = tf.constant(0.6,dtype=tf.float32)
Regular = 1e-8
#prediction
p = w*x
#loss
l = tf.square(p - y)
#gradients
g = tf.gradients(l, w)
#print(g)
#print(tf.square(g))
#update
update1 = tf.assign_add(second_derivative,tf.square(g[0]))
g_final = LR * g[0] / (tf.sqrt(second_derivative) + Regular)
update2 = tf.assign(w, w - g_final)
#update = tf.assign(w, w - LR * g[0])
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run([g,p,w], {x: 1}))
for _ in range(20):
_ = sess.run(update1,feed_dict={x:1.0})
w_,g_,l_,g_sec_ = sess.run([w,g,l,second_derivative],feed_dict={x:1.0})
print('variable is w:',w_, ' g is ',g_,' g_sec_ is ',g_sec_,' and the loss is ',l_)
#sess.run(g_final)
_ = sess.run(update2,feed_dict={x:1.0})
结果接近,可惜不完全一样,我也不知道optimizer中的参数都是多少,有没有正则化,太不透明了。
[[-2.0], 2.0, 2.0]
variable is w: 2.0 g is [-2.0] g_sec_ is 0.0 and the loss is 1.0
variable is w: 2.6 g is [-0.8000002] g_sec_ is 4.0 and the loss is 0.16000007
variable is w: 2.8228343 g is [-0.3543315] g_sec_ is 4.6400003 and the loss is 0.0313877
variable is w: 2.920222 g is [-0.15955591] g_sec_ is 4.765551 and the loss is 0.006364522
variable is w: 2.9639592 g is [-0.072081566] g_sec_ is 4.791009 and the loss is 0.0012989381
variable is w: 2.9837074 g is [-0.032585144] g_sec_ is 4.7962046 and the loss is 0.0002654479
variable is w: 2.9926338 g is [-0.014732361] g_sec_ is 4.7972665 and the loss is 5.4260614e-05
variable is w: 2.9966695 g is [-0.0066609383] g_sec_ is 4.7974834 and the loss is 1.1092025e-05
variable is w: 2.9984941 g is [-0.0030117035] g_sec_ is 4.797528 and the loss is 2.2675895e-06
variable is w: 2.999319 g is [-0.0013618469] g_sec_ is 4.797537 and the loss is 4.6365676e-07
variable is w: 2.9996922 g is [-0.0006155968] g_sec_ is 4.7975388 and the loss is 9.4739846e-08
variable is w: 2.9998608 g is [-0.0002784729] g_sec_ is 4.797539 and the loss is 1.9386789e-08
variable is w: 2.999937 g is [-0.00012588501] g_sec_ is 4.797539 and the loss is 3.961759e-09
variable is w: 2.9999716 g is [-5.6743622e-05] g_sec_ is 4.797539 and the loss is 8.0495965e-10
variable is w: 2.9999871 g is [-2.5749207e-05] g_sec_ is 4.797539 and the loss is 1.6575541e-10
variable is w: 2.9999943 g is [-1.1444092e-05] g_sec_ is 4.797539 and the loss is 3.274181e-11
variable is w: 2.9999974 g is [-5.2452087e-06] g_sec_ is 4.797539 and the loss is 6.8780537e-12
variable is w: 2.9999988 g is [-2.3841858e-06] g_sec_ is 4.797539 and the loss is 1.4210855e-12
variable is w: 2.9999995 g is [-9.536743e-07] g_sec_ is 4.797539 and the loss is 2.2737368e-13
variable is w: 2.9999998 g is [-4.7683716e-07] g_sec_ is 4.797539 and the loss is 5.684342e-14
variable is w: 2.0 g: [-2.0] and the loss is 1.0
variable is w: 2.5926378 g: [-0.81472445] and the loss is 0.16594398
variable is w: 2.816606 g: [-0.3667879] and the loss is 0.033633344
variable is w: 2.9160419 g: [-0.1679163] and the loss is 0.0070489706
variable is w: 2.9614334 g: [-0.07713318] and the loss is 0.0014873818
variable is w: 2.9822717 g: [-0.035456657] and the loss is 0.00031429363
variable is w: 2.9918494 g: [-0.016301155] and the loss is 6.6431916e-05
variable is w: 2.9962525 g: [-0.0074949265] and the loss is 1.404348e-05
variable is w: 2.998277 g: [-0.0034461021] and the loss is 2.968905e-06
variable is w: 2.9992077 g: [-0.0015845299] and the loss is 6.2768373e-07
variable is w: 2.9996357 g: [-0.0007286072] and the loss is 1.327171e-07
variable is w: 2.9998324 g: [-0.00033521652] and the loss is 2.809253e-08
variable is w: 2.999923 g: [-0.0001540184] and the loss is 5.930417e-09
variable is w: 2.9999645 g: [-7.104874e-05] and the loss is 1.2619807e-09
variable is w: 2.9999835 g: [-3.2901764e-05] and the loss is 2.7063152e-10
variable is w: 2.9999924 g: [-1.5258789e-05] and the loss is 5.820766e-11
variable is w: 2.9999964 g: [-7.1525574e-06] and the loss is 1.2789769e-11
variable is w: 2.9999983 g: [-3.33786e-06] and the loss is 2.7853275e-12
variable is w: 2.9999993 g: [-1.4305115e-06] and the loss is 5.1159077e-13
variable is w: 2.9999998 g: [-4.7683716e-07] and the loss is 5.684342e-14
这个例子只供演示,真正体现Adagrad优势的,还得是多参数情形,单参数用Adagrad不能显现很大优势,Adagrad的一大优点,是能协调不同参数的学习速率,每个参数都被自己的“二次微分”约束,最后就公平了。
源码