梯度的计算是频繁的任务。在所有的的learning algorithm里几乎都用到了梯度。可以参考这片训练算法总结。本文中我具体介绍各式各样的训来拿算法,而把焦点聚焦在梯度计算这个子任务上。
#梯度的定义
对于一个 R n → R R^n \rightarrow R Rn→R的函数: f ( x 1 , x 2 , . . . , x n ) f(x_1, x_2, ..., x_n) f(x1,x2,...,xn), 它的梯度可以定义为:
∇ f ( x 1 , x 2 , . . . , x n ) = g r a d f ( x 1 , x 2 , . . . , x n ) = ( ∂ f ∂ x 1 , ∂ f ∂ x 2 , . . . , ∂ f ∂ x n ) \nabla f(x_1, x_2, ..., x_n) = grad\space f(x_1, x_2, ..., x_n) = (\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n}) ∇f(x1,x2,...,xn)=grad f(x1,x2,...,xn)=(∂x1∂f,∂x2∂f,...,∂xn∂f)
##微分的计算
梯度的计算涉及到函数微分的计算,一般来说主要有三种方式来计算函数的微分。
###Numertical differentiation
这种方式是通过微分的定义直接来计算的:
∂ f ∂ x = f ′ ( x ) = lim h → 0 f ( x + h ) − f ( x ) h \frac{\partial f}{\partial x}=f'(x)=\lim_{h\rightarrow0}\frac{f(x+h)-f(x)}{h} ∂x∂f=f′(x)=h→0limhf(x+h)−f(x)
实际计算中我们去一个较小的值 h h h,用这种方式计算的结果近似实际的微分值。这种方式的优点是即使不知道是什么函数也能计算微分值,缺点也是很明显的,就是h的值取多大才能满足误差的需求呢,还有就是计算中的各种误差累积的结果可能导致最终的误差比较大。
我们来通过一个例子看一下误差与h的取值的关系.
比如对于函数 f = x 3 f = x^3 f=x3, 误差为:
e r r o r = ∣ f ( x + h ) − f ( x ) h − 3 x 2 ∣ error=|\frac{f(x+h)-f(x)}{h} - 3x^2| error=∣hf(x+h)−f(x)−3x2∣
横轴表示h,纵轴表示误差值 e r r o r error error.
###Symbolic Differentitation
这种方式必须首先知道函数的表达式,然后通过法则推出导数的表达式来计算微分值。
例如,我们需要计算函数 f = x s i n ( x ) f=xsin(x) f=xsin(x)的微分,根据法则,我们得到:
∂ f ∂ x = s i n ( x ) + x c o s ( x ) \frac{\partial f}{\partial x}=sin(x)+xcos(x) ∂x∂f=sin(x)+xcos(x)
随着模型规模的增大,实际中TF中的函数可能非常复杂,推导出微分的表达式非常困难。
###Automatic Differentiation
我们来看一下这种方式与其他方式的关系:
在介绍TF中auto differentiation的具体实现之前,先介绍一下链式法则:
假设有可微函数 y = f ( w 2 ) , w 2 = g ( w 1 ) , w 1 = h ( w 0 ) , w 0 = J ( x ) y=f(w_2), w_2=g(w_1), w_1=h(w_0), w_0=J(x) y=f(w2),w2=g(w1),w1=h(w0),w0=J(x),则:
d y d x = d y d w 2 d w 2 d w 1 d w 1 d w 0 d w 0 d x \frac{dy}{dx}=\frac{dy}{dw_2}\frac{dw_2}{dw_1}\frac{dw_1}{dw_0}\frac{dw_0}{dx} dxdy=dw2dydw1dw2dw0dw1dxdw0
假设有可微分的函数 z = f ( x , y ) , x = g ( t ) , y = h ( t ) z=f(x,y), x = g(t), y = h(t) z=f(x,y),x=g(t),y=h(t),则:
d f d t = ∂ f ∂ x d x d t + ∂ f ∂ y d y d t = ∇ f ( x , y ) ⋅ [ d x d t d y d t ] \frac{df}{dt}=\frac{\partial f}{\partial x}\frac{dx}{dt}+\frac{\partial f}{\partial y}\frac{dy}{dt}=\nabla f(x,y) \cdot \left[ \begin{matrix} \frac{dx}{dt} \\ \frac{dy}{dt} \end{matrix} \right] dtdf=∂x∂fdtdx+∂y∂fdtdy=∇f(x,y)⋅[dtdxdtdy]
假设有可微函数 z = f ( x , y ) , x = g ( s , t ) , y = h ( s t ) z=f(x,y),x=g(s,t),y=h(st) z=f(x,y),x=g(s,t),y=h(st), 则:
∇ f ( s , t ) = [ d f d s , d f d t ] = [ d f d x d x d s + d f d y d y d s , d f d x d x d t + d f d y d y d t ] = ∇ f ( x , y ) ⋅ [ d x d s , d x d t d y d s , d y d t ] \nabla f(s,t)=[\frac{df}{ds}, \frac{df}{dt}]=[\frac{df}{dx}\frac{dx}{ds}+\frac{df}{dy}\frac{dy}{ds}, \frac{df}{dx}\frac{dx}{dt}+\frac{df}{dy}\frac{dy}{dt}]=\nabla f(x,y)\cdot \left[ \begin{matrix} \frac{dx}{ds}, \frac{dx}{dt} \\ \frac{dy}{ds}, \frac{dy}{dt} \end{matrix} \right] ∇f(s,t)=[dsdf,dtdf]=[dxdfdsdx+dydfdsdy,dxdfdtdx+dydfdtdy]=∇f(x,y)⋅[dsdx,dtdxdsdy,dtdy]
更一般的情况,假设有可微分函数 y = f ( x 1 , x 2 , . . . , x n ) , x i = g i ( t 1 , t 2 , . . . , t n ) y=f(x_1, x_2, ..., x_n), x_i = g_i(t_1, t_2, ..., t_n) y=f(x1,x2,...,xn),xi=gi(t1,t2,...,tn), 则:
∂ f ∂ t i = ∑ j = 1 n ∂ f ∂ x j ∂ x j ∂ t i \frac{\partial f}{\partial t_i} = \sum_{j=1}^{n}\frac{\partial f}{\partial x_j}\frac{\partial x_j}{\partial t_i} ∂ti∂f=j=1∑n∂xj∂f∂ti∂xj
由此我们得出微分计算的两种方式:
d w i d x = d w i d w i − 1 d w i − 1 d x , w 3 = y \frac{dw_i}{dx}=\frac{dw_i}{dw_{i-1}}\frac{dw_{i-1}}{dx}, w_3=y dxdwi=dwi−1dwidxdwi−1,w3=y
d y d w i = d y d w i + 1 d w i + 1 d w i , w 0 = x \frac{dy}{dw_i}=\frac{dy}{dw_{i+1}}\frac{dw_{i+1}}{dw_{i}}, w_0=x dwidy=dwi+1dydwidwi+1,w0=x
##梯度的计算
下面介绍梯度的计算,以 R → R R\rightarrow R R→R函数为例,假设有函数 L = z , z = z ( y ) , y = y ( x ) L = z, z = z(y), y = y(x) L=z,z=z(y),y=y(x), 由微分的链式计算得出公式:
grad L(z) = dL/dz = 1
grad L(y) = dL/dy = dL/dz * dz/dy = grad L(z) * dz/dy
grad L(x) = dL/dx = dL/dy * dy/dx = grad L(y) * dy/dx
则个过程被称为梯度的逆向传播。由微分的计算公式,很容易推导出 R m → R n R^m\rightarrow R^n Rm→Rn函数梯度的传播公式,以后有时间再补充。
##复变函数的梯度计算
以上的讨论的函数,都是在实数范围内,现在有很多基于复数模型,TF中也支持compplex64和complex128复数变量类型。对于复变函数来说,也存在计算梯度的问题,而且TF中复变函数的梯度计算容易引起误解,比如我们来看一个问题:
复变函数的梯度问题
我在问题的回复中解释的也比较详细了,这里再复述一遍。
还是以最简单的函数 C → C : z = f ( w ) C\rightarrow C: z = f(w) C→C:z=f(w)来介绍。对于复变函数,如果我们沿用实数函数梯度的定义方式,则上面的梯度链式计算方式就不适用了,为了满足链式计算方式,我们为复变函数定义梯度如下:
类似的,我们来看一下复变函数的梯度是如何传播的:
假设有复变函数:$ L = z, z = z(y), y = y(x), L,z,y,x$都是复数。
grad_real L(z) = (dz_real/dz_real + i * dz_real/dz_imag) = (1, 0)
grad_imag L(z) = (dz_imag/dz_real + i * dz_imag/dz_real) = (0, i)
grad_real L(y) = (dz_real/dy_real + i * dz_real/dy_imag)
= (1, 0) * (dz_real/dy_real + i * dz_real/dy_imag)
And we know that for a analytic fucntion z = z(y) , wo get Cauchy–Riemann equations:
dz_real/dy_real = dz_imag/dy_imag
dz_real/dy_imag = - dz_imag/dz_real
and we know that :
dz/dy = dz_real/dy_real + i * dz_imag/dz_real
= dz_real/dy_real - i * dz_real/dy_imag
= dz_imag/dy_imag + i * dz_imag/dy_real
= dz_imag/dy_imag - i * dz_real/dy_imag
so,
grad_real L(y) = (1, 0) * (dz_real/dy_real + i * dz_real/dy_imag)
= grad_real(z) * conjugate(dz_real/dy_real - i * dz_real/dy_imag)
= grad_real(z) * conjugate(dz/dy)
grad_imag L(y) = (dz_imag/dy_real + i * dz_imag/dy_imag)
= (0, i) * (dz_imag/dy_imag - i * dz_imag/dy_real)
= grad_imag(z) * conjugate(dz_imag/dy_imag + i * dz_imag/dy_real)
= grad_imag(z) * conjugate(dz/dy)
grad_real L(x) = dz_real/dx_real + i * dz_real/dx_imag
= (dz_real/dy_real * dy_real/dx_real + dz_real/dy_imag * dy_imag/dx_real) + i * (dz_real/dy_real*dy_real/dx_imag + dz_real/dy_imag * dy_imag/dx_imag)
= dz_real/dy_real * (dy_real/dx_real + i * dy_real/dx_imag) + dz_real/dy_imag * (dy_imag/dx_real + i * dy_imag/dx_imag)
= dz_real/dy_real * (dy_real/dx_real + i * dy_real/dx_imag) + i * dz_real/dy_imag * (dy_imag/dx_imag - i * dy_imag/dx_real)
= dz_real/dy_real * conjugate(dy_real/dx_real - i * dy_real/dx_imag) + i * dz_real/dy_imag * conjugate(dy_imag/dx_imag + i * dy_imag/dx_real)
= dz_real/dy_real * conjugate(dy/dx) + i * dz_real/dy_imag * conjugate(dy/dx)
= (dz_real/dy_real + i * dz_real/dy_imag) * conjugate(dy/dx)
= grad_real L(y) * conjugate(dy/dx)
grad_real L(x) = dz_imag/dx_real + i * dz_imag/dx_imag
= (dz_imag/dy_real * dy_real/dx_real + dz_imag/dy_imag * dy_imag/dx_real) + i * (dz_imag/dy_real * dy_real/dx_imag + dz_imag/dy_imag * dy_imag/dx_imag)
= dz_imag/dy_real * (dy_real/dx_real + i * dy_real/dx_imag) + dz_imag/dy_imag * (dy_imag/dx_real + i * dy_imag/dx_imag)
= dz_imag/dy_real * (dy_real/dx_real + i * dy_real/dx_imag) + i * dz_imag/dy_imag * (dy_imag/dx_imag - dy_imag/dx_real)
= dz_imag/dy_real * conjugate(dy_real/dx_real - i * dy_real/dx_imag) + i * dz_imag/dy_imag * conjugate(dy_imag/dx_imag + dy_imag/dx_real)
= dz_imag/dy_real * conjugate(dy/dx) + i * dz_imag/dy_imag * conjugate(dy/dx)
= (dz_imag/dy_real + i * dz_imag/dy_imag) * conjugate(dy/dx)
= grad_imag L(y) * conjugate(dy/dx)
因为实数的共轭复数永远是自己本身,所以以上的实数函数和复变函数的梯度计算公式可以合并成如下的形式:
Grad L(y) = Grad L(z) * conjugate(dz/dy)
Grad L(x) = Grad L(y) * conjugate(dy/dx)
##TF中的实现
TF为每个需要计算梯度的基本操作都注册一个相对应的梯度函数,TF的会通过自动添加梯度函数来实现梯度自动计算。