tensorflow_梯度下降详解(转载)

梯度下降的概念

梯度下降法是一个一阶最优化算法,通常也称为最速下降法。要使用梯度下降法找到一个函数的局部极小值,必须向函数上当前点对于梯度(或者是近似梯度)的反方向的规定步长距离点进行迭代搜索。所以梯度下降法可以帮助我们求解某个函数的极小值或者最小值。对于n维问题就最优解,梯度下降法是最常用的方法之一。下面通过梯度下降法的前生今世来进行详细推导说明。

梯度下降法的前世

首先从简单的开始,看下面的一维函数:

f(x) = x^3 + 2 * x - 3

tensorflow_梯度下降详解(转载)_第1张图片

在数学中如果我们要求f(x) = 0处的解,我们可以通过如下误差等式来求得:

error = (f(x) - 0)^2

error趋近于最小值时,也就是f(x) = 0x的解,我们也可以通过图来观察:
tensorflow_梯度下降详解(转载)_第2张图片
通过这函数图,我们可以非常直观的发现,要想求得该函数的最小值,只要将x指定为函数图的最低谷。这在高中我们就已经掌握了该函数的最小值解法。我们可以通过对该函数进行求导(即斜率):

derivative(x) = 6 * x^5 + 16 * x^3 - 18 * x^2 + 8 * x - 12

如果要得到最小值,只需令derivative(x) = 0,即x = 1。同时我们结合图与导函数可以知道:

x < 1时,derivative < 0,斜率为负的;
x > 1时,derivative > 0,斜率为正的;
当x 无限接近 1时,derivative也就无限=0,斜率为零。
通过上面的结论,我们可以使用如下表达式来代替x在函数中的移动

x = x - reate * derivative

当斜率为负的时候,x增大,当斜率为正的时候,x减小;因此x总是会向着低谷移动,使得error最小,从而求得 f(x) = 0处的解。其中的rate代表x逆着导数方向移动的距离,rate越大,x每次就移动的越多。反之移动的越少。
这是针对简单的函数,我们可以非常直观的求得它的导函数。为了应对复杂的函数,我们可以通过使用求导函数的定义来表达导函数:若函数f(x)在点x0处可导,那么有如下定义:在这里插入图片描述
上面是都是公式推导,下面通过代码来实现,下面的代码都是使用python进行实现。

def f(x):
    return x**3 + 2*x - 3


def error(x):
    return (f(x) - 0) ** 2


def gredient_descent(x):
    delta = 0.0000001
    derivative = (error(x + delta) - error(x)) / delta
    rate = 0.01
    return x - rate * derivative


x = 0.1
for i in range(50):
    x = gredient_descent(x)
    print("x = {:6f},f(x) = {:6f}".format(x, f(x)))

执行上面程序,我们就能得到如下结果:

x = 0.213639,f(x) = -2.562970
x = 0.323177,f(x) = -2.319892
x = 0.430510,f(x) = -2.059189
x = 0.535777,f(x) = -1.774648
x = 0.637328,f(x) = -1.466469
x = 0.731727,f(x) = -1.144763
x = 0.814293,f(x) = -0.831477
x = 0.880632,f(x) = -0.555794
x = 0.928725,f(x) = -0.341495
x = 0.960058,f(x) = -0.194987
x = 0.978641,f(x) = -0.105437
x = 0.988917,f(x) = -0.055047
x = 0.994349,f(x) = -0.028159
x = 0.997146,f(x) = -0.014246
x = 0.998566,f(x) = -0.007166
x = 0.999281,f(x) = -0.003594
x = 0.999640,f(x) = -0.001800
x = 0.999820,f(x) = -0.000901
x = 0.999910,f(x) = -0.000451
x = 0.999955,f(x) = -0.000225
x = 0.999977,f(x) = -0.000113
x = 0.999989,f(x) = -0.000057
x = 0.999994,f(x) = -0.000028
x = 0.999997,f(x) = -0.000014
x = 0.999999,f(x) = -0.000007
x = 0.999999,f(x) = -0.000004
x = 1.000000,f(x) = -0.000002
x = 1.000000,f(x) = -0.000001
x = 1.000000,f(x) = -0.000001
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000
x = 1.000000,f(x) = -0.000000

通过上面的结果,也验证了我们最初的结论。x = 1时,f(x) = 0
所以通过该方法,只要步数足够多,就能得到非常精确的值。

梯度下降法的今生

上面是对一维函数进行求解,那么对于多维函数又要如何求呢?我们接着看下面的函数,你会发现对于多维函数也是那么的简单。

f(x) = x[0] + 2 * x[1] + 4

同样的如果我们要求f(x) = 0处,x[0]x[1]的值,也可以通过求error函数的最小值来间接求f(x)的解。跟一维函数唯一不同的是,要分别对x[0]x[1]进行求导。在数学上叫做偏导数:

保持x[1]不变,对x[0]进行求导,即f(x)x[0]的偏导数
保持x[0]不变,对x[1]进行求导,即f(x)x[1]的偏导数
有了上面的理解基础,我们定义的gradient_descent如下:

def gradient_descent(x):
    delta = 0.00000001
    derivative_x0 = (error([x[0] + delta, x[1]]) - error([x[0], x[1]])) / delta
    derivative_x1 = (error([x[0], x[1] + delta]) - error([x[0], x[1]])) / delta
    rate = 0.01
    x[0] = x[0] - rate * derivative_x0
    x[1] = x[1] - rate * derivative_x1
    return [x[0], x[1]]

rate的作用不变,唯一的区别就是分别获取最新的x[0]x[1]。下面是整个代码:

>>> def f(x):
     	return x[0] + 2 * x[1] + 4

>>> def error(x):
		return (f(x) - 0)**2

>>> def gradient_descent(x):
    	delta = 0.00000001
    	derivative_x0 = (error([x[0] + delta, x[1]]) - error([x[0], x[1]])) / delta
    	derivative_x1 = (error([x[0], x[1] + delta]) - error([x[0], x[1]])) / delta
  	   rate = 0.02
	    x[0] = x[0] - rate * derivative_x0
	    x[1] = x[1] - rate * derivative_x1
	    return [x[0], x[1]]

>>> x = [-0.5, -1.0]
>>> for i in range(100):
     	x = gradient_descent(x)
     	print('x = {:6f},{:6f}, f(x) = {:6f}'.format(x[0],x[1],f(x)))

输出结果为:

x = -0.560000,-1.120000, f(x) = 1.200000
x = -0.608000,-1.216000, f(x) = 0.960000
x = -0.646400,-1.292800, f(x) = 0.768000
x = -0.677120,-1.354240, f(x) = 0.614400
x = -0.701696,-1.403392, f(x) = 0.491520
x = -0.721357,-1.442714, f(x) = 0.393216
x = -0.737085,-1.474171, f(x) = 0.314573
x = -0.749668,-1.499337, f(x) = 0.251658
x = -0.759735,-1.519469, f(x) = 0.201327
x = -0.767788,-1.535575, f(x) = 0.161061
x = -0.774230,-1.548460, f(x) = 0.128849
x = -0.779384,-1.558768, f(x) = 0.103079
x = -0.783507,-1.567015, f(x) = 0.082463
x = -0.786806,-1.573612, f(x) = 0.065971
x = -0.789445,-1.578889, f(x) = 0.052777
x = -0.791556,-1.583112, f(x) = 0.042221
x = -0.793245,-1.586489, f(x) = 0.033777
x = -0.794596,-1.589191, f(x) = 0.027022
x = -0.795677,-1.591353, f(x) = 0.021617
x = -0.796541,-1.593082, f(x) = 0.017294
x = -0.797233,-1.594466, f(x) = 0.013835
x = -0.797786,-1.595573, f(x) = 0.011068
x = -0.798229,-1.596458, f(x) = 0.008854
x = -0.798583,-1.597167, f(x) = 0.007084
x = -0.798867,-1.597733, f(x) = 0.005667
x = -0.799093,-1.598187, f(x) = 0.004533
x = -0.799275,-1.598549, f(x) = 0.003627
x = -0.799420,-1.598839, f(x) = 0.002901
x = -0.799536,-1.599072, f(x) = 0.002321
x = -0.799629,-1.599257, f(x) = 0.001857
x = -0.799703,-1.599406, f(x) = 0.001486
x = -0.799762,-1.599525, f(x) = 0.001188
x = -0.799810,-1.599620, f(x) = 0.000951
x = -0.799848,-1.599696, f(x) = 0.000761
x = -0.799878,-1.599757, f(x) = 0.000608
x = -0.799903,-1.599805, f(x) = 0.000487
x = -0.799922,-1.599844, f(x) = 0.000389
x = -0.799938,-1.599875, f(x) = 0.000312
x = -0.799950,-1.599900, f(x) = 0.000249
x = -0.799960,-1.599920, f(x) = 0.000199
x = -0.799968,-1.599936, f(x) = 0.000159
x = -0.799974,-1.599949, f(x) = 0.000128
x = -0.799980,-1.599959, f(x) = 0.000102
x = -0.799984,-1.599967, f(x) = 0.000082
x = -0.799987,-1.599974, f(x) = 0.000065
x = -0.799990,-1.599979, f(x) = 0.000052
x = -0.799992,-1.599983, f(x) = 0.000042
x = -0.799993,-1.599987, f(x) = 0.000033
x = -0.799995,-1.599989, f(x) = 0.000027
x = -0.799996,-1.599991, f(x) = 0.000021
x = -0.799997,-1.599993, f(x) = 0.000017
x = -0.799997,-1.599995, f(x) = 0.000014
x = -0.799998,-1.599996, f(x) = 0.000011
x = -0.799998,-1.599997, f(x) = 0.000009
x = -0.799999,-1.599997, f(x) = 0.000007
x = -0.799999,-1.599998, f(x) = 0.000006
x = -0.799999,-1.599998, f(x) = 0.000004
x = -0.799999,-1.599999, f(x) = 0.000004
x = -0.799999,-1.599999, f(x) = 0.000003
x = -0.800000,-1.599999, f(x) = 0.000002
x = -0.800000,-1.599999, f(x) = 0.000002
x = -0.800000,-1.599999, f(x) = 0.000001
x = -0.800000,-1.600000, f(x) = 0.000001
x = -0.800000,-1.600000, f(x) = 0.000001
x = -0.800000,-1.600000, f(x) = 0.000001
x = -0.800000,-1.600000, f(x) = 0.000001
x = -0.800000,-1.600000, f(x) = 0.000000

细心的你可能会发现,f(x) = 0不止这一个解还可以是x = -2, -1。这是因为梯度下降法只是对当前所处的凹谷进行梯度下降求解,对于error函数并不代表只有一个f(x) = 0的凹谷。所以梯度下降法只能求得局部解,但不一定能求得全部的解。当然如果对于非常复杂的函数,能够求得局部解也是非常不错的。

tensorflow中的应用

通过上面的示例,相信对梯度下降也有了一个基本的认识。现在我们回到最开始的地方,在tensorflow中使用gradientDescent

import tensorflow as tf
 
# Model parameters
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
# Model input and output
x = tf.placeholder(tf.float32)
linear_model = W*x + b
y = tf.placeholder(tf.float32)
 
# loss
loss = tf.reduce_sum(tf.square(linear_model - y)) # sum of the squares
# optimizer
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
 
# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
# training loop
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init) # reset values to wrong
for i in range(1000):
  sess.run(train, {x: x_train, y: y_train})
 
# evaluate training accuracy
curr_W, curr_b, curr_loss = sess.run([W, b, loss], {x: x_train, y: y_train})
print("W: %s b: %s loss: %s"%(curr_W, curr_b, curr_loss))

上面的是tensorflow的官网示例,上面代码定义了函数linear_model = W * x + b,其中的error函数为linear_model - y。目的是对一组x_trainy_train进行简单的训练求解W与b。为了求得这一组数据的最优解,将每一组的error相加从而得到loss,最后再对loss进行梯度下降求解最优值。

optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)

在这里rate0.01,因为这个示例也是多维函数,所以也要用到偏导数来进行逐步向最优解靠近。

for i in range(1000):
  sess.run(train, {x: x_train, y: y_train})

最后使用梯度下降进行循环推导,下面给出结果

W: [-0.9999969] b: [0.9999908] loss: 5.6999738e-11

这里就不推理验证了,如果看了上面的梯度下降的前世今生,相信能够自主的推导出来。那么我们直接看最后的结果,可以估算为W = -1.0b = 1.0,将他们带入上面的loss得到的结果为0.0,即误差损失值最小,所以W = -1.0b = 1.0就是x_trainy_train这组数据的最优解。

你可能感兴趣的:(Python,TensorFlow)