神经翻译笔记3扩展a. 深度学习的矩阵微积分基础

文章目录

  • 神经翻译笔记3扩展a. 深度学习的矩阵微积分基础
    • 预备知识
    • 矩阵微积分
      • 雅可比矩阵的泛化
      • 两向量间逐元素运算的导数
      • 向量与标量运算的导数
      • 向量的求和规约操作
      • 链式法则
        • 单变量链式法则
        • 单变量全微分链式法则
        • 向量的链式法则
    • 激活函数的梯度
    • 神经网络损失函数的梯度

神经翻译笔记3扩展a. 深度学习的矩阵微积分基础

写在前面:矩阵微积分是深度学习的数学基础之一,但是这部分内容在大学计算机系(及相关非数学类专业)本科几乎没有介绍,想要了解全凭自学。我之前看过的比较好的资料有三个:维基百科的Matrix Calculus词条、The Matrix Cookbook和Towser的《机器学习中的矩阵/向量求导》。然而前两个都是英文资料,而且主要重视结论,可以当做字典用,但是看完总有点“知其然不知其所以然”的感觉(维基词条似乎有简单的计算过程介绍,但是还是有点简略);Towser大神的文章写得很不错,但是我数学较差,看到张量相关的部分还是觉得有点脑子转不过来

最近这几天我整理自己的微博收藏时,无意间发现北邮陈光老师(爱可可爱生活@微博)曾经推荐过一篇文章:Terence Parr和Jeremy Howard合作的The Matrix Calculus You Need For Deep Learning,感觉比较符合我的需求(再次证明收藏过的东西如果不回顾就是白收藏),相对Towser的文章更基础一点。这篇博客就是我阅读该论文的一些摘要

预备知识

对于一元函数的导数,存在如下几条规则(以下均认为 x x x是自变量):

  • 常数函数的导数为0: f ( x ) = c → d f / d x = 0 f(x) = c \rightarrow df/dx = 0 f(x)=cdf/dx=0
  • 常量相乘法则: ( c f ( x ) ) ′ = c d f d x (cf(x))' = c\frac{df}{dx} (cf(x))=cdxdf
  • 幂函数求导法则: f ( x ) = x n → d f d x = n x n − 1 f(x) = x^n \rightarrow \frac{df}{dx} = nx^{n-1} f(x)=xndxdf=nxn1
  • 加法法则: d d x ( f ( x ) + g ( x ) ) = d f d x + d g d x \frac{d}{dx}(f(x) + g(x)) = \frac{df}{dx} + \frac{dg}{dx} dxd(f(x)+g(x))=dxdf+dxdg
  • 减法法则: d d x ( f ( x ) − g ( x ) ) = d f d x − d g d x \frac{d}{dx}(f(x) - g(x)) = \frac{df}{dx} - \frac{dg}{dx} dxd(f(x)g(x))=dxdfdxdg
  • 乘法法则: d d x ( f ( x ) ⋅ g ( x ) ) = f ( x ) ⋅ d g d x + d f d x ⋅ g ( x ) \frac{d}{dx}(f(x)\cdot g(x)) = f(x)\cdot \frac{dg}{dx} + \frac{df}{dx}\cdot g(x) dxd(f(x)g(x))=f(x)dxdg+dxdfg(x)
  • 链式法则: d d x ( f ( g ( x ) ) ) = d f ( u ) d u ⋅ d u d x \frac{d}{dx}(f(g(x))) = \frac{df(u)}{du}\cdot \frac{du}{dx} dxd(f(g(x)))=dudf(u)dxdu,若令 u = g ( x ) u=g(x) u=g(x)

对于二元函数,需要引入偏导数的概念。假设函数为 f ( x , y ) f(x,y) f(x,y),求函数对 x x x y y y的偏导数时,将另一个变量看作是常量(对于多元函数,求对某个变量的偏导数时,将其它所有变量都视为常量)。求得的偏导数可以组合成为梯度,记为 ∇ f ( x , y ) \nabla f(x,y) f(x,y)。即
∇ f ( x , y ) = [ ∂ f ( x , y ) ∂ x ∂ f ( x , y ) ∂ y ] \nabla f(x,y) = \left[\begin{matrix}\frac{\partial f(x,y)}{\partial x} \\ \frac{\partial f(x,y)}{\partial y}\end{matrix}\right] f(x,y)=[xf(x,y)yf(x,y)]
(注意:原文里梯度表示成了一个行向量,并说明他们使用了numerator layout。但是这样会使得标量对向量的微分结果与该向量形状相异,也与主流记法相左。因此本文记录时做了转换,使用主流的denominator layout记法)

矩阵微积分

同一个函数对不同变量求偏导的结果可以组成称为梯度,多个函数对各变量求偏导的结果则可以组合成一个矩阵,称为雅可比矩阵(Jacobian matrix)。例如如果有函数 f , g f, g f,g,两者各自的梯度可以组合为
J = [ ∇ T f ( x , y ) ∇ T g ( x , y ) ] J = \left[\begin{matrix}\nabla^\mathsf{T} f(x,y) \\ \nabla^\mathsf{T} g(x, y)\end{matrix}\right] J=[Tf(x,y)Tg(x,y)]

雅可比矩阵的泛化

可以将多元变量组合成一个向量: f ( x 1 , x 2 , … , x n ) = f ( x ) f(x_1, x_2, \ldots, x_n) = f(\boldsymbol{x}) f(x1,x2,,xn)=f(x)(本文认为所有向量都是 n × 1 n \times 1 n×1维的,即
x = [ x 1 x 2 ⋮ x n ] \boldsymbol{x} = \left[\begin{matrix}x_1 \\ x_2 \\ \vdots \\ x_n\end{matrix}\right] x=x1x2xn
)假设有 m m m个函数分别向量 x \boldsymbol{x} x计算得出一个标量,即
y 1 = f 1 ( x ) y 2 = f 2 ( x ) ⋮ y m = f m ( x ) \begin{aligned} y_1 &= f_1(\boldsymbol{x}) \\ y_2 &= f_2(\boldsymbol{x}) \\ &\vdots \\ y_m &= f_m(\boldsymbol{x}) \end{aligned} y1y2ym=f1(x)=f2(x)=fm(x)
可以简记为
y = f ( x ) \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{x}) y=f(x)
y \boldsymbol{y} y x \boldsymbol{x} x求导的结果就是将每个函数对 x \boldsymbol{x} x的导数堆叠起来得到的雅可比矩阵:
∂ y ∂ x = [ ∂ ∂ x 1 f 1 ( x ) ∂ ∂ x 2 f 1 ( x ) ⋯ ∂ ∂ x n f 1 ( x ) ∂ ∂ x 1 f 2 ( x ) ∂ ∂ x 2 f 2 ( x ) ⋯ ∂ ∂ x n f 2 ( x ) ⋮ ⋮ ⋱ ⋮ ∂ ∂ x 1 f m ( x ) ∂ ∂ x 2 f m ( x ) ⋯ ∂ ∂ x n f m ( x ) ] \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \left[\begin{matrix}\frac{\partial}{\partial x_1}f_1(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_1(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_1(\boldsymbol{x})\\ \frac{\partial}{\partial x_1}f_2(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_2(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_2(\boldsymbol{x})\\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial}{\partial x_1}f_m(\boldsymbol{x}) & \frac{\partial}{\partial x_2}f_m(\boldsymbol{x}) & \cdots & \frac{\partial}{\partial x_n}f_m(\boldsymbol{x})\\ \end{matrix}\right] xy=x1f1(x)x1f2(x)x1fm(x)x2f1(x)x2f2(x)x2fm(x)xnf1(x)xnf2(x)xnfm(x)

两向量间逐元素运算的导数

假设 ◯ \bigcirc 是一个对两向量逐元素进行计算的操作符(例如 ⨁ \bigoplus 是向量加法,就是两个向量逐元素相加),则对于 y = f ( w ) ◯ g ( x ) \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{w}) \bigcirc \boldsymbol{g}(\boldsymbol{x}) y=f(w)g(x),假设 n = m = ∣ y ∣ = ∣ w ∣ = ∣ x ∣ n = m = |y| = |w| = |x| n=m=y=w=x,可以做如下展开
[ y 1 y 2 ⋮ y n ] = [ f 1 ( w ) ◯ g 1 ( x ) f 2 ( w ) ◯ g 2 ( x ) ⋮ f n ( w ) ◯ g n ( x ) ] \left[\begin{matrix}y_1 \\ y_2 \\ \vdots \\ y_n\end{matrix}\right] = \left[\begin{matrix} f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x}) \\ f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x}) \\ \vdots \\ f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})\end{matrix}\right] y1y2yn=f1(w)g1(x)f2(w)g2(x)fn(w)gn(x)
y \boldsymbol{y} y分别对 w \boldsymbol{w} w x \boldsymbol{x} x求导可以得到两个方阵
J w = ∂ y ∂ w = [ ∂ ∂ w 1 ( f 1 ( w ) ◯ g 1 ( x ) ) ∂ ∂ w 2 ( f 1 ( w ) ◯ g 1 ( x ) ) ⋯ ∂ ∂ w n ( f 1 ( w ) ◯ g 1 ( x ) ) ∂ ∂ w 1 ( f 2 ( w ) ◯ g 2 ( x ) ) ∂ ∂ w 2 ( f 2 ( w ) ◯ g 2 ( x ) ) ⋯ ∂ ∂ w n ( f 2 ( w ) ◯ g 2 ( x ) ) ⋮ ⋮ ⋱ ⋮ ∂ ∂ w 1 ( f n ( w ) ◯ g n ( x ) ) ∂ ∂ w 2 ( f n ( w ) ◯ g n ( x ) ) ⋯ ∂ ∂ w n ( f n ( w ) ◯ g n ( x ) ) ] J_{\boldsymbol{w}} = \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{w}} = \left[\begin{matrix}\frac{\partial }{\partial w_1}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) \\ \frac{\partial }{\partial w_1}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial }{\partial w_1}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \frac{\partial }{\partial w_2}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial w_n}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) \\ \end{matrix}\right] Jw=wy=w1(f1(w)g1(x))w1(f2(w)g2(x))w1(fn(w)gn(x))w2(f1(w)g1(x))w2(f2(w)g2(x))w2(fn(w)gn(x))wn(f1(w)g1(x))wn(f2(w)g2(x))wn(fn(w)gn(x))

J x = ∂ y ∂ x = [ ∂ ∂ x 1 ( f 1 ( w ) ◯ g 1 ( x ) ) ∂ ∂ x 2 ( f 1 ( w ) ◯ g 1 ( x ) ) ⋯ ∂ ∂ x n ( f 1 ( w ) ◯ g 1 ( x ) ) ∂ ∂ x 1 ( f 2 ( w ) ◯ g 2 ( x ) ) ∂ ∂ x 2 ( f 2 ( w ) ◯ g 2 ( x ) ) ⋯ ∂ ∂ x n ( f 2 ( w ) ◯ g 2 ( x ) ) ⋮ ⋮ ⋱ ⋮ ∂ ∂ x 1 ( f n ( w ) ◯ g n ( x ) ) ∂ ∂ x 2 ( f n ( w ) ◯ g n ( x ) ) ⋯ ∂ ∂ x n ( f n ( w ) ◯ g n ( x ) ) ] J_{\boldsymbol{x}} = \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \left[\begin{matrix}\frac{\partial }{\partial x_1}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \frac{\partial }{\partial x_2}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial x_n}(f_1(\boldsymbol{w}) \bigcirc g_1(\boldsymbol{x})) \\ \frac{\partial }{\partial x_1}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \frac{\partial }{\partial x_2}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial x_n}(f_2(\boldsymbol{w}) \bigcirc g_2(\boldsymbol{x})) \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial }{\partial x_1}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \frac{\partial }{\partial x_2}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) & \cdots & \frac{\partial }{\partial x_n}(f_n(\boldsymbol{w}) \bigcirc g_n(\boldsymbol{x})) \\ \end{matrix}\right] Jx=xy=x1(f1(w)g1(x))x1(f2(w)g2(x))x1(fn(w)gn(x))x2(f1(w)g1(x))x2(f2(w)g2(x))x2(fn(w)gn(x))xn(f1(w)g1(x))xn(f2(w)g2(x))xn(fn(w)gn(x))

由于 ◯ \bigcirc 是逐元素操作,因此 f i f_i fi是一个关于 w i w_i wi的函数, g i g_i gi也是一个关于 x i x_i xi的函数。因此当 j ̸ = i j \not= i j̸=i时, ∂ ∂ w j f i ( w i ) = ∂ ∂ w j g i ( x i ) = 0 \frac{\partial }{\partial w_j}f_i(w_i) = \frac{\partial }{\partial w_j}g_i(x_i) = 0 wjfi(wi)=wjgi(xi)=0,而且$ 0 \bigcirc 0 = 0$,因此上述两个雅可比矩阵都是对角矩阵,可以简记为
∂ y ∂ w = d i a g ( ∂ ∂ w 1 ( f 1 ( w 1 ) ◯ g 1 ( x 1 ) ) , ∂ ∂ w 2 ( f 2 ( w 2 ) ◯ g 2 ( x 2 ) ) , … , ∂ ∂ w n ( f n ( w n ) ◯ g n ( x n ) ) ) \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{w}} = {\rm diag}\left(\frac{\partial}{\partial w_1}(f_1(w_1) \bigcirc g_1(x_1)), \frac{\partial}{\partial w_2}(f_2(w_2) \bigcirc g_2(x_2)), \ldots, \frac{\partial}{\partial w_n}(f_n(w_n) \bigcirc g_n(x_n))\right) wy=diag(w1(f1(w1)g1(x1)),w2(f2(w2)g2(x2)),,wn(fn(wn)gn(xn)))

∂ y ∂ x = d i a g ( ∂ ∂ x 1 ( f 1 ( x 1 ) ◯ g 1 ( x 1 ) ) , ∂ ∂ x 2 ( f 2 ( x 2 ) ◯ g 2 ( x 2 ) ) , … , ∂ ∂ x n ( f n ( x n ) ◯ g n ( x n ) ) ) \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = {\rm diag}\left(\frac{\partial}{\partial x_1}(f_1(x_1) \bigcirc g_1(x_1)), \frac{\partial}{\partial x_2}(f_2(x_2) \bigcirc g_2(x_2)), \ldots, \frac{\partial}{\partial x_n}(f_n(x_n) \bigcirc g_n(x_n))\right) xy=diag(x1(f1(x1)g1(x1)),x2(f2(x2)g2(x2)),,xn(fn(xn)gn(xn)))

如果只考虑最简单的向量计算,不妨令 f ( w ) = w , g ( x ) = x \boldsymbol{f}(\boldsymbol{w}) = \boldsymbol{w}, \boldsymbol{g}(\boldsymbol{x}) = \boldsymbol{x} f(w)=w,g(x)=x,因此有 f i ( w i ) = w i f_i(w_i) = w_i fi(wi)=wi。假设 ◯ \bigcirc 为逐元素相加,那么会有
∂ ∂ w i ( f i ( w i ) + g i ( x i ) ) = 1 = ∂ ∂ x ( f i ( w i ) + g i ( x i ) ) \frac{\partial }{\partial w_i}(f_i(w_i) + g_i(x_i)) = 1 = \frac{\partial}{\partial x}(f_i(w_i) + g_i(x_i)) wi(fi(wi)+gi(xi))=1=x(fi(wi)+gi(xi))
其它减、乘、除同理,可以得到如下式子(由于向量加法 + + +的法则就是 ⊕ \oplus ,因此 ⊕ \oplus 简写为 + + +。减法同理)
∂ ( w + x ) ∂ w = I = ∂ ( w + x ) ∂ x ∂ ( w − x ) ∂ w = I ∂ ( w − x ) ∂ x = − I ∂ ( w ⊗ x ) ∂ w = d i a g ( x ) ∂ ( w ⊗ x ) ∂ x = d i a g ( w ) ∂ ( w ⊘ x ) ∂ w = d i a g ( ⋯ 1 x i ⋯   ) ∂ ( w ⊘ x ) ∂ x = d i a g ( ⋯ − w i x i 2 ⋯   ) \begin{aligned} \frac{\partial (\boldsymbol{w} + \boldsymbol{x})}{\partial \boldsymbol{w}} &= \boldsymbol{I} = \frac{\partial (\boldsymbol{w} + \boldsymbol{x})}{\partial \boldsymbol{x}}\\ \frac{\partial (\boldsymbol{w} - \boldsymbol{x})}{\partial \boldsymbol{w}} &= \boldsymbol{I} \\ \frac{\partial (\boldsymbol{w} - \boldsymbol{x})}{\partial \boldsymbol{x}} &= -\boldsymbol{I} \\ \frac{\partial (\boldsymbol{w} \otimes \boldsymbol{x})}{\partial \boldsymbol{w}} &= {\rm diag}(\boldsymbol{x}) \\ \frac{\partial (\boldsymbol{w} \otimes \boldsymbol{x})}{\partial \boldsymbol{x}} &= {\rm diag}(\boldsymbol{w}) \\ \frac{\partial (\boldsymbol{w} \oslash \boldsymbol{x})}{\partial \boldsymbol{w}} &= {\rm diag}\left(\cdots \frac{1}{x_i}\cdots\right)\\ \frac{\partial (\boldsymbol{w} \oslash \boldsymbol{x})}{\partial \boldsymbol{x}} &= {\rm diag}\left(\cdots -\frac{w_i}{x_i^2}\cdots\right) \end{aligned} w(w+x)w(wx)x(wx)w(wx)x(wx)w(wx)x(wx)=I=x(w+x)=I=I=diag(x)=diag(w)=diag(xi1)=diag(xi2wi)

向量与标量运算的导数

向量与标量之间的运算,可以把标量扩展成一个同维度的向量,然后对两个向量做逐元素的运算。标量的扩展一般是将其乘以一个全为1的向量,例如计算 y = x + z \boldsymbol{y} = \boldsymbol{x} + z y=x+z实际上是 y = f ( x ) + g ( z ) \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{x}) + \boldsymbol{g}(z) y=f(x)+g(z),其中 f ( x ) = x , g ( z ) = 1 z \boldsymbol{f}(\boldsymbol{x}) = \boldsymbol{x}, \boldsymbol{g}(z) = \boldsymbol{1}z f(x)=x,g(z)=1z

因此可以得到
∂ ∂ x ( x + z ) = I ∂ ∂ x ( x z ) = I z ∂ ∂ z ( x + z ) = 1 ∂ ∂ z ( x z ) = x \begin{aligned} \frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{x} + z) &= \boldsymbol{I} \\ \frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{x}z) &= \boldsymbol{I}z \\ \frac{\partial}{\partial z}(\boldsymbol{x} + z) &= \boldsymbol{1} \\ \frac{\partial}{\partial z}(\boldsymbol{x}z) &= \boldsymbol{x} \end{aligned} x(x+z)x(xz)z(x+z)z(xz)=I=Iz=1=x

向量的求和规约操作

求和规约,原文是sum reduce,实际上就是把一个向量的所有元素求和

y = s u m ( f ( x ) ) = ∑ i = 1 n f i ( x ) y = {\rm sum}(\boldsymbol{f}(\boldsymbol{x})) = \sum_{i=1}^n f_i(\boldsymbol{x}) y=sum(f(x))=i=1nfi(x),展开可得
∂ y ∂ x = [ ∂ y ∂ x 1 ∂ y ∂ x 2 ⋯ ∂ y ∂ x n ] T = [ ∂ ∂ x 1 ∑ i f i ( x ) ∂ ∂ x 2 ∑ i f i ( x ) ⋯ ∂ ∂ x n ∑ i f i ( x ) ] T = [ ∑ i ∂ f i ( x ) ∂ x 1 ∑ i ∂ f i ( x ) ∂ x 2 ⋯ ∑ i ∂ f i ( x ) ∂ x n ] T \begin{aligned} \frac{\partial y}{\partial \boldsymbol{x}} &= \left[\begin{matrix}\frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} & \cdots & \frac{\partial y}{\partial x_n} \end{matrix}\right]^\mathsf{T} \\ &= \left[\begin{matrix}\frac{\partial }{\partial x_1}\sum_i f_i(\boldsymbol{x}) & \frac{\partial }{\partial x_2}\sum_i f_i(\boldsymbol{x}) & \cdots & \frac{\partial }{\partial x_n}\sum_i f_i(\boldsymbol{x}) \end{matrix}\right]^\mathsf{T} \\ &= \left[\begin{matrix}\sum_i \frac{\partial f_i(\boldsymbol{x})}{\partial x_1} & \sum_i \frac{\partial f_i(\boldsymbol{x})}{\partial x_2} & \cdots & \sum_i \frac{\partial f_i(\boldsymbol{x})}{\partial x_n}\end{matrix}\right]^\mathsf{T} \end{aligned} xy=[x1yx2yxny]T=[x1ifi(x)x2ifi(x)xnifi(x)]T=[ix1fi(x)ix2fi(x)ixnfi(x)]T
讨论一个最简单的情况,就是 y = s u m ( x ) y={\rm sum}(\boldsymbol{x}) y=sum(x)。由之前的讨论,有 f i ( x ) = x i f_i(\boldsymbol{x}) = x_i fi(x)=xi。将定义代入上式,并考虑对 i ̸ = j i \not= j i̸=j ∂ ∂ x j x i = 0 \frac{\partial }{\partial x_j}x_i = 0 xjxi=0,易得
y = s u m ( x ) → ∇ y = 1 y = {\rm sum}(\boldsymbol{x}) \rightarrow \nabla y = \boldsymbol{1} y=sum(x)y=1

链式法则

对复杂表达式(各种表达式组合嵌套得到的表达式),需要使用链式法则求导。本文将链式法则分为了三种

单变量链式法则

其实就是高数里学到的。假设 y = f ( g ( x ) ) y = f(g(x)) y=f(g(x)),令 g ( x ) = u g(x) = u g(x)=u,则单变量链式法则为
d y d x = d y d u d u d x \frac{dy}{dx} = \frac{dy}{du}\frac{du}{dx} dxdy=dudydxdu

单变量全微分链式法则

在最简单的单变量链式法则里,所有中间变量都是单变量函数。对于 y = f ( x ) = x + x 2 y = f(x) = x+x^2 y=f(x)=x+x2这种函数,情况会复杂一些。如果引入中间变量 u 1 u_1 u1 u 2 u_2 u2,有
u 1 ( x ) = x 2 u 2 ( x , u 1 ) = x + u 1 ( y = f ( x ) = u 2 ( x , u 1 ) ) \begin{aligned} &u_1(x) &&= x^2 \\ &u_2(x, u_1) &&= x + u_1 && (y=f(x)=u_2(x, u_1)) \end{aligned} u1(x)u2(x,u1)=x2=x+u1(y=f(x)=u2(x,u1))
如果只使用前面的单变量链式法则,那么 d u 2 / d u 1 = 1 , d u 1 / d x = 2 x du_2/du_1 = 1, du_1/dx = 2x du2/du1=1,du1/dx=2x,所以 d y / d x = d u 2 / d x = d u 2 / d u 1 ⋅ d u 1 / d x = 2 x dy/dx = du_2/dx = du_2/du_1 \cdot du_1/dx = 2x dy/dx=du2/dx=du2/du1du1/dx=2x,跟结果( 2 x + 1 2x+1 2x+1)对不上。如果因为 u 2 u_2 u2有两个变量而引入偏导数,那么
∂ u 1 ( x ) ∂ x = 2 x ∂ u 2 ( x , u 1 ) ∂ u 1 = 1 \begin{aligned} \frac{\partial u_1(x)}{\partial x} &= 2x \\ \frac{\partial u_2(x, u_1)}{\partial u_1} &= 1 \end{aligned} xu1(x)u1u2(x,u1)=2x=1
**注意此时不能直接说 ∂ u 2 / ∂ x = 1 \partial u_2 / \partial x = 1 u2/x=1!**因为对 x x x求偏导数的时候限定了其它变量不会随着 x x x的变化而变化,但是 u 1 u_1 u1实际上还是关于 x x x的函数。因此应该使用单变量全微分链式法则:假设 u 1 , … , u n u_1, \ldots, u_n u1,,un(可能)都与 x x x有关,则
∂ f ( x , u 1 , … , u n ) ∂ x = ∂ f ∂ x + ∑ i = 1 n ∂ f ∂ u i ∂ u i ∂ x \frac{\partial f(x, u_1, \ldots, u_n)}{\partial x} = \frac{\partial f}{\partial x} + \sum_{i=1}^n \frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x} xf(x,u1,,un)=xf+i=1nuifxui
将上面的例子代入,可知
∂ f ( x , u 1 ) ∂ x = ∂ f ∂ x + ∂ f ∂ u 1 ∂ u 1 ∂ x = 1 + 2 x \frac{\partial f(x, u_1)}{\partial x} = \frac{\partial f}{\partial x} + \frac{\partial f}{\partial u_1}\frac{\partial u_1}{\partial x} = 1 + 2x xf(x,u1)=xf+u1fxu1=1+2x
注意全微分公式在任何时候都是算偏导数的加和,并不是因为例子里存在求和操作,而是因为它表示的是各个 x x x y y y的变化量的加权求和。考虑式子 y = f ( x ) = x ⋅ x 2 y = f(x) = x \cdot x^2 y=f(x)=xx2,那么
u 1 ( x ) = x 2 u 2 ( x , u 1 ) = x u 1 ∂ u 1 ∂ x = 2 x ∂ u 2 ∂ x = u 1 ∂ u 2 ∂ u 1 = x \begin{aligned} &u_1(x) && = x^2 \\ &u_2(x, u_1) && = xu_1 \\ &\frac{\partial u_1}{\partial x} && = 2x \\ &\frac{\partial u_2}{\partial x} && = u_1 \\ &\frac{\partial u_2}{\partial u_1} && = x \end{aligned} u1(x)u2(x,u1)xu1xu2u1u2=x2=xu1=2x=u1=x
使用全微分公式,可知
d y d x = ∂ u 2 ∂ x + ∂ u 2 ∂ u 1 ∂ u 1 ∂ x = u 1 + x ⋅ 2 x = x 2 + 2 x 2 = 3 x 2 \frac{dy}{dx} = \frac{\partial u_2}{\partial x} + \frac{\partial u_2}{\partial u_1}\frac{\partial u_1}{\partial x} = u_1 + x \cdot 2x = x^2 + 2x^2 = 3x^2 dxdy=xu2+u1u2xu1=u1+x2x=x2+2x2=3x2
对前面介绍的全微分链式法则,如果引入一个新的变量 u n + 1 = x u_{n+1} = x un+1=x,则可以得到一个更简洁的公式
∂ f ( u 1 , … , u n + 1 ) ∂ x = ∑ i = 1 n + 1 ∂ f ∂ u i ∂ u i ∂ x \frac{\partial f(u_1, \ldots, u_{n+1})}{\partial x} = \sum_{i=1}^{n+1}\frac{\partial f}{\partial u_i}\frac{\partial u_i}{\partial x} xf(u1,,un+1)=i=1n+1uifxui
这看上去很像两个向量的内积:
∂ f ∂ u ∂ u ∂ x \frac{\partial f}{\partial \boldsymbol{u}} \frac{\partial \boldsymbol{u}}{\partial x} ufxu

向量的链式法则

为了引出向量的链式法则,可以先看一个例子。假设 y = f ( x ) \boldsymbol{y} = \boldsymbol{f}(x) y=f(x),其中
[ y 1 ( x ) y 2 ( x ) ] = [ f 1 ( x ) f 2 ( x ) ] = [ ln ⁡ ( x 2 ) sin ⁡ ( 3 x ) ] \left[ \begin{matrix}y_1(x) \\ y_2(x)\end{matrix} \right] = \left[ \begin{matrix}f_1(x) \\ f_2(x)\end{matrix} \right] = \left[ \begin{matrix}\ln(x^2) \\ \sin (3x)\end{matrix} \right] [y1(x)y2(x)]=[f1(x)f2(x)]=[ln(x2)sin(3x)]
引入两个中间变量 g 1 g_1 g1 g 2 g_2 g2,使得 y = f ( g ( x ) ) \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{g}(x)) y=f(g(x)),其中
[ g 1 ( x ) g 2 ( x ) ] = [ x 2 3 x ] [ f 1 ( g ) f 2 ( g ) ] = [ ln ⁡ ( g 1 ) sin ⁡ ( g 2 ) ] \begin{aligned} \left[ \begin{matrix}g_1(x) \\ g_2(x)\end{matrix} \right] &= \left[ \begin{matrix}x^2 \\ 3x\end{matrix} \right] \\ \left[ \begin{matrix}f_1(\boldsymbol{g}) \\ f_2(\boldsymbol{g})\end{matrix} \right] &= \left[ \begin{matrix}\ln (g_1) \\ \sin (g_2)\end{matrix} \right] \end{aligned} [g1(x)g2(x)][f1(g)f2(g)]=[x23x]=[ln(g1)sin(g2)]
那么 ∂ y / ∂ x \partial \boldsymbol{y}/\partial x y/x就可以使用全微分链式法则:
∂ y ∂ x = [ ∂ f 1 ( g ) ∂ x ∂ f 2 ( g ) ∂ x ] = [ ∂ f 1 ∂ g 1 ∂ g 1 ∂ x + ∂ f 1 ∂ g 2 ∂ g 2 ∂ x ∂ f 2 ∂ g 1 ∂ g 1 ∂ x + ∂ f 2 ∂ g 2 ∂ g 2 ∂ x ] \frac{\partial \boldsymbol{y}}{\partial x} = \left[ \begin{matrix}\frac{\partial f_1(\boldsymbol{g})}{\partial x} \\ \frac{\partial f_2(\boldsymbol{g})}{\partial x}\end{matrix} \right] = \left[\begin{matrix}\frac{\partial f_1}{\partial g_1} \frac{\partial g_1}{\partial x} + \frac{\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x} \\ \frac{\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x} + \frac{\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x}\end{matrix}\right] xy=[xf1(g)xf2(g)]=[g1f1xg1+g2f1xg2g1f2xg1+g2f2xg2]
这个向量可以写成矩阵与向量相乘的形式
[ ∂ f 1 ∂ g 1 ∂ g 1 ∂ x + ∂ f 1 ∂ g 2 ∂ g 2 ∂ x ∂ f 2 ∂ g 1 ∂ g 1 ∂ x + ∂ f 2 ∂ g 2 ∂ g 2 ∂ x ] = [ ∂ f 1 ∂ g 1 ∂ f 1 ∂ g 2 ∂ f 2 ∂ g 1 ∂ f 2 ∂ g 2 ] [ ∂ g 1 ∂ x ∂ g 2 ∂ x ] \left[\begin{matrix}\frac{\partial f_1}{\partial g_1} \frac{\partial g_1}{\partial x} + \frac{\partial f_1}{\partial g_2}\frac{\partial g_2}{\partial x} \\ \frac{\partial f_2}{\partial g_1}\frac{\partial g_1}{\partial x} + \frac{\partial f_2}{\partial g_2}\frac{\partial g_2}{\partial x}\end{matrix}\right] = \left[\begin{matrix}\frac{\partial f_1}{\partial g_1} & \frac{\partial f_1}{\partial g_2} \\ \frac{\partial f_2}{\partial g_1} & \frac{\partial f_2}{\partial g_2}\end{matrix}\right]\left[\begin{matrix}\frac{\partial g_1}{\partial x} \\ \frac{\partial g_2}{\partial x}\end{matrix}\right] [g1f1xg1+g2f1xg2g1f2xg1+g2f2xg2]=[g1f1g1f2g2f1g2f2][xg1xg2]
可以看到这个矩阵是雅可比矩阵,向量同理,即
[ ∂ f 1 ∂ g 1 ∂ f 1 ∂ g 2 ∂ f 2 ∂ g 1 ∂ f 2 ∂ g 2 ] [ ∂ g 1 ∂ x ∂ g 2 ∂ x ] = ∂ f ∂ g ∂ g ∂ x \left[\begin{matrix}\frac{\partial f_1}{\partial g_1} & \frac{\partial f_1}{\partial g_2} \\ \frac{\partial f_2}{\partial g_1} & \frac{\partial f_2}{\partial g_2}\end{matrix}\right]\left[\begin{matrix}\frac{\partial g_1}{\partial x} \\ \frac{\partial g_2}{\partial x}\end{matrix}\right] = \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{g}}\frac{\partial \boldsymbol{g}}{\partial x} [g1f1g1f2g2f1g2f2][xg1xg2]=gfxg
x x x不是标量而是向量 x \boldsymbol{x} x时,同样的法则也成立,这意味着我们可以得到一个向量的链式法则
y = f ( g ( x ) ) → ∂ y ∂ x = ∂ f ∂ g ∂ g ∂ x \boldsymbol{y} = \boldsymbol{f}(\boldsymbol{g}(\boldsymbol{x})) \rightarrow \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{g}}\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} y=f(g(x))xy=gfxg
事实上,这个公式可以更加简洁。对很多应用,雅可比矩阵都是对角方阵,此时 f i f_i fi是一个只与 g i g_i gi有关的函数,而 g i g_i gi是一个只与 x i x_i xi有关的函数。即
∂ f ∂ g = d i a g ( ∂ f i ∂ g i ) ∂ g ∂ x = d i a g ( ∂ g i ∂ x i ) \begin{aligned} \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{g}} &= {\rm diag}\left(\frac{\partial f_i}{\partial g_i}\right) \\ \frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} &= {\rm diag}\left(\frac{\partial g_i}{\partial x_i}\right) \end{aligned} gfxg=diag(gifi)=diag(xigi)
因此在这种情况下,向量的链式法则可以简化为
∂ ∂ x f ( g ( x ) ) = d i a g ( ∂ f i ∂ g i ) d i a g ( ∂ g i ∂ x i ) = d i a g ( ∂ f i ∂ g i ∂ g i ∂ x i ) \frac{\partial }{\partial \boldsymbol{x}}\boldsymbol{f}(\boldsymbol{g}(\boldsymbol{x})) = {\rm diag}\left(\frac{\partial f_i}{\partial g_i}\right){\rm diag}\left(\frac{\partial g_i}{\partial x_i}\right) = {\rm diag}\left(\frac{\partial f_i}{\partial g_i}\frac{\partial g_i}{\partial x_i}\right) xf(g(x))=diag(gifi)diag(xigi)=diag(gifixigi)

激活函数的梯度

假设网络是全连接前馈神经网络(即这里不考虑卷积、RNN等),激活函数为ReLU,那么有
a c t i v a t i o n ( x ) = max ⁡ ( 0 , w ⋅ x + b ) {\rm activation}(\boldsymbol{x}) = \max(0, \boldsymbol{w} \cdot \boldsymbol{x} + b) activation(x)=max(0,wx+b)
要计算的是 ∂ ∂ w ( w ⋅ x + b ) \frac{\partial }{\partial \boldsymbol{w}}(\boldsymbol{w} \cdot \boldsymbol{x} + b) w(wx+b) ∂ ∂ b ( w ⋅ x + b ) \frac{\partial }{\partial b}(\boldsymbol{w} \cdot \boldsymbol{x} + b) b(wx+b)。尽管之前没讨论过向量内积对向量的偏导数,但是考虑
w ⋅ x = ∑ i n ( w i x i ) = s u m ( w ⊗ x ) \boldsymbol{w} \cdot \boldsymbol{x} = \sum_{i}^n (w_ix_i) = {\rm sum}(\boldsymbol{w} \otimes \boldsymbol{x}) wx=in(wixi)=sum(wx)
而之前讨论过 s u m ( x ) {\rm sum}(\boldsymbol{x}) sum(x) w ⊗ x \boldsymbol{w} \otimes \boldsymbol{x} wx的偏导数和向量的链式法则,那么引入中间变量
u = w ⊗ x y = s u m ( u ) \begin{aligned} \boldsymbol{u} &= \boldsymbol{w} \otimes \boldsymbol{x} \\ y &= {\rm sum}(\boldsymbol{u}) \end{aligned} uy=wx=sum(u)
根据上面的推导,有
∂ u ∂ w = d i a g ( x ) ∂ y ∂ u = 1 \begin{aligned} \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{w}} &= {\rm diag}(\boldsymbol{x}) \\ \frac{\partial y}{\partial \boldsymbol{u}} &= \boldsymbol{1} \end{aligned} wuuy=diag(x)=1
因此
∂ y ∂ w = ∂ y ∂ u ∂ u ∂ w = 1 ⋅ d i a g ( x ) = x \frac{\partial y}{\partial \boldsymbol{w}} = \frac{\partial y}{\partial \boldsymbol{u}}\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{w}} = \boldsymbol{1}\cdot {\rm diag}(\boldsymbol{x}) = \boldsymbol{x} wy=uywu=1diag(x)=x
再令 y = w ⋅ x + b y = \boldsymbol{w} \cdot \boldsymbol{x} + b y=wx+b,不难得出
∂ y ∂ w = x ,     ∂ y ∂ b = 1 \frac{\partial y}{\partial \boldsymbol{w}} = \boldsymbol{x},\ \ \ \frac{\partial y}{\partial b} = 1 wy=x,   by=1
激活函数是一个分段函数,且在0点不连续。分段求导可得
∂ ∂ z max ⁡ ( 0 , z ) = { 0 z ≤ 0 d z d z = 1 z > 0 \begin{aligned} \frac{\partial }{\partial z}\max(0, z) = \begin{cases}0 & z \le 0 \\ \frac{dz}{dz} = 1 & z > 0\end{cases} \end{aligned} zmax(0,z)={0dzdz=1z0z>0
因此
∂ ∂ x max ⁡ ( 0 , x ) = [ ∂ ∂ x 1 max ⁡ ( 0 , x 1 ) ∂ ∂ x 2 max ⁡ ( 0 , x 2 ) ⋮ ∂ ∂ x n max ⁡ ( 0 , x n ) ] \frac{\partial}{\partial \boldsymbol{x}}\max(0, \boldsymbol{x}) = \left[\begin{matrix}\frac{\partial }{\partial x_1}\max(0, x_1) \\ \frac{\partial }{\partial x_2}\max(0, x_2) \\ \vdots \\ \frac{\partial }{\partial x_n}\max(0, x_n)\end{matrix}\right] xmax(0,x)=x1max(0,x1)x2max(0,x2)xnmax(0,xn)
综合起来,对激活函数,引入中间变量 z z z来表示仿射变换,有
z ( w , b , x ) = w ⋅ x + b a c t i v a t i o n ( z ) = max ⁡ ( 0 , z ) \begin{aligned} z(\boldsymbol{w}, b, \boldsymbol{x}) &= \boldsymbol{w} \cdot \boldsymbol{x} + b \\ {\rm activation}(z) &= \max(0, z) \end{aligned} z(w,b,x)activation(z)=wx+b=max(0,z)
通过链式法则
∂ a c t i v a t i o n ∂ w = ∂ a c t i v a t i o n ∂ z ∂ z ∂ w \frac{\partial {\rm activation}}{\partial \boldsymbol{w}} = \frac{\partial {\rm activation}}{\partial z} \frac{\partial z}{\partial \boldsymbol{w}} wactivation=zactivationwz
代入前面的推导,有
∂ a c t i v a t i o n ∂ w = { 0 ∂ z ∂ w = 0 z ≤ 0 1 ∂ z ∂ w = x z > 0 ∂ a c t i v a t i o n ∂ b = { 0 z ≤ 0 1 z > 0 \begin{aligned} \frac{\partial {\rm activation}}{\partial \boldsymbol{w}} &= \begin{cases}0\frac{\partial z}{\partial \boldsymbol{w}} = \boldsymbol{0} & z \le 0\\ 1\frac{\partial z}{\partial \boldsymbol{w}} = \boldsymbol{x} & z > 0\end{cases} \\ \frac{\partial {\rm activation}}{\partial b} &= \begin{cases}0 & z \le 0 \\ 1 & z > 0 \end{cases} \end{aligned} wactivationbactivation={0wz=01wz=xz0z>0={01z0z>0

神经网络损失函数的梯度

最后考虑一个完整的例子。假设模型的输入是 X \boldsymbol{X} X,且
X = [ x 1 , x 2 , … , x N ] T \boldsymbol{X} = [\boldsymbol{x}_1,\boldsymbol{x}_2, \ldots, \boldsymbol{x}_N]^\mathsf{T} X=[x1,x2,,xN]T
样本个数为 N = ∣ X ∣ N = |\boldsymbol{X}| N=X,令结果向量为
y = [ t a r g e t ( x 1 ) , t a r g e t ( x 2 ) , … , t a r g e t ( x N ) ] T \boldsymbol{y} = [{\rm target}(\boldsymbol{x}_1), {\rm target}(\boldsymbol{x}_2), \ldots, {\rm target}(\boldsymbol{x}_N)]^\mathsf{T} y=[target(x1),target(x2),,target(xN)]T
其中每个 y i y_i yi都是一个标量。假设损失函数使用平方误差函数,那么损失函数 C C C
C ( w , b , X , y ) = 1 N ∑ i = 1 N ( y i − a c t i v a t i o n ( x i ) ) 2 = 1 N ∑ i = 1 N ( y i − max ⁡ ( 0 , w ⋅ x i + b ) ) 2 \begin{aligned} C(\boldsymbol{w}, b, \boldsymbol{X}, \boldsymbol{y}) &= \frac{1}{N}\sum_{i=1}^N(y_i - {\rm activation}(\boldsymbol{x}_i))^2 \\ &= \frac{1}{N}\sum_{i=1}^N(y_i - \max(0, \boldsymbol{w} \cdot \boldsymbol{x}_i + b))^2 \end{aligned} C(w,b,X,y)=N1i=1N(yiactivation(xi))2=N1i=1N(yimax(0,wxi+b))2
引入中间变量,有
u ( w , b , x ) = max ⁡ ( 0 , w ⋅ x + b ) v ( y , u ) = y − u C ( v ) = 1 N ∑ i = 1 N v 2 \begin{aligned} u(\boldsymbol{w}, b, \boldsymbol{x}) &= \max(0, \boldsymbol{w} \cdot \boldsymbol{x} + b) \\ v(y, u) &= y -u\\ C(v) &= \frac{1}{N}\sum_{i=1}^N v^2 \end{aligned} u(w,b,x)v(y,u)C(v)=max(0,wx+b)=yu=N1i=1Nv2
这里主要说明如何计算权重的梯度。偏置 b b b的计算方法同理。由前面的推导可知
∂ ∂ w u ( w , b , x ) = { 0 w ⋅ x + b ≤ 0 x w ⋅ x + b > 0 \frac{\partial }{\partial \boldsymbol{w}} u(\boldsymbol{w}, b, \boldsymbol{x}) = \begin{cases}\boldsymbol{0} & \boldsymbol{w} \cdot \boldsymbol{x} + b \le 0 \\ \boldsymbol{x} & \boldsymbol{w}\cdot \boldsymbol{x} + b > 0\end{cases} wu(w,b,x)={0xwx+b0wx+b>0

∂ v ( y , u ) ∂ w = ∂ ∂ w ( y − u ) = 0 − ∂ u ∂ w = − ∂ u ∂ w = { 0 w ⋅ x + b ≤ 0 − x w ⋅ x + b > 0 \frac{\partial v(y, u)}{\partial \boldsymbol{w}} = \frac{\partial }{\partial \boldsymbol{w}}(y - u) = \boldsymbol{0} - \frac{\partial u}{\partial \boldsymbol{w}} = -\frac{\partial u}{\partial \boldsymbol{w}} = \begin{cases}\boldsymbol{0} & \boldsymbol{w} \cdot \boldsymbol{x} + b \le 0 \\ -\boldsymbol{x} & \boldsymbol{w}\cdot \boldsymbol{x} + b > 0\end{cases} wv(y,u)=w(yu)=0wu=wu={0xwx+b0wx+b>0
因此,整个梯度的计算过程为
∂ C ( v ) ∂ w = ∂ ∂ w 1 N ∑ i = 1 N v 2 = 1 N ∑ i = 1 N ∂ v 2 ∂ v ∂ v ∂ w = 1 N ∑ i = 1 N 2 v ∂ v ∂ w \begin{aligned} \frac{\partial C(v)}{\partial \boldsymbol{w}} &= \frac{\partial}{\partial \boldsymbol{w}}\frac{1}{N}\sum_{i=1}^N v^2 \\ &= \frac{1}{N}\sum_{i=1}^N\frac{\partial v^2}{\partial v}\frac{\partial v}{\partial \boldsymbol{w}} \\ &= \frac{1}{N}\sum_{i=1}^N2v\frac{\partial v}{\partial \boldsymbol{w}} \end{aligned} wC(v)=wN1i=1Nv2=N1i=1Nvv2wv=N1i=1N2vwv
中间展开过程略,最后可得
∂ C ( v ) ∂ w = { 0 w ⋅ x i + b ≤ 0 2 N ∑ i = 1 N ( w ⋅ x i + b − y i ) x i w ⋅ x i + b > 0 \frac{\partial C(v)}{\partial \boldsymbol{w}} = \begin{cases}\boldsymbol{0} & \boldsymbol{w}\cdot \boldsymbol{x}_i + b \le 0 \\ \frac{2}{N}\sum_{i=1}^N (\boldsymbol{w}\cdot \boldsymbol{x}_i + b - y_i)\boldsymbol{x}_i & \boldsymbol{w}\cdot \boldsymbol{x}_i + b > 0\end{cases} wC(v)={0N2i=1N(wxi+byi)xiwxi+b0wxi+b>0
后面关于梯度下降的讨论略

你可能感兴趣的:(神经翻译笔记,矩阵微分)