∂ y ∂ x = ( ∂ y ∂ x 1 ∂ y ∂ x 2 ⋯ ∂ y ∂ x n ) \frac{\partial y}{\partial \mathbf{x}} = \begin{pmatrix} \frac{\partial y}{\partial x_1} & \frac{\partial y}{\partial x_2} &\cdots & \frac{\partial y}{\partial x_n} \end{pmatrix} ∂x∂y=(∂x1∂y∂x2∂y⋯∂xn∂y)
∂ y ∂ x = ( ∂ y 1 ∂ x ∂ y 2 ∂ x ⋮ ∂ y n ∂ x ) \frac{\partial \mathbf{y}}{\partial x} = \begin{pmatrix} \frac{\partial y_1}{\partial x} \\ \frac{\partial y_2}{\partial x} \\ \vdots \\ \frac{\partial y_n}{\partial x} \end{pmatrix} ∂x∂y= ∂x∂y1∂x∂y2⋮∂x∂yn
∂ y ∂ x = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x n ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ y m ∂ x 1 ∂ y m ∂ x 2 ⋯ ∂ y m ∂ x n ] \frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{cccc} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array}\right] ∂x∂y= ∂x1∂y1∂x1∂y2⋮∂x1∂ym∂x2∂y1∂x2∂y2⋮∂x2∂ym⋯⋯⋱⋯∂xn∂y1∂xn∂y2⋮∂xn∂ym
∂ y ∂ X = [ ∂ y ∂ x 11 ∂ y ∂ x 21 ⋯ ∂ y ∂ x p 1 ∂ y ∂ x 12 ∂ y ∂ x 22 ⋯ ∂ y ∂ x p 2 ⋮ ⋮ ⋱ ⋮ ∂ y ∂ x 1 q ∂ y ∂ x 2 q ⋯ ∂ y ∂ x p q ] \frac{\partial y}{\partial \mathbf{X}}=\left[\begin{array}{cccc} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{21}} & \cdots & \frac{\partial y}{\partial x_{p 1}} \\ \frac{\partial y}{\partial x_{12}} & \frac{\partial y}{\partial x_{22}} & \cdots & \frac{\partial y}{\partial x_{p 2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y}{\partial x_{1 q}} & \frac{\partial y}{\partial x_{2 q}} & \cdots & \frac{\partial y}{\partial x_{p q}} \end{array}\right] ∂X∂y= ∂x11∂y∂x12∂y⋮∂x1q∂y∂x21∂y∂x22∂y⋮∂x2q∂y⋯⋯⋱⋯∂xp1∂y∂xp2∂y⋮∂xpq∂y
∂ Y ∂ x = [ ∂ y 11 ∂ x ∂ y 12 ∂ x ⋯ ∂ y 1 n ∂ x ∂ y 21 ∂ x ∂ y 22 ∂ x ⋯ ∂ y 2 n ∂ x ⋮ ⋮ ⋱ ⋮ ∂ y m 1 ∂ x ∂ y m 2 ∂ x ⋯ ∂ y m n ∂ x ] \frac{\partial \mathbf{Y}}{\partial x}=\left[\begin{array}{cccc} \frac{\partial y_{11}}{\partial x} & \frac{\partial y_{12}}{\partial x} & \cdots & \frac{\partial y_{1 n}}{\partial x} \\ \frac{\partial y_{21}}{\partial x} & \frac{\partial y_{22}}{\partial x} & \cdots & \frac{\partial y_{2 n}}{\partial x} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_{m 1}}{\partial x} & \frac{\partial y_{m 2}}{\partial x} & \cdots & \frac{\partial y_{m n}}{\partial x} \end{array}\right] ∂x∂Y= ∂x∂y11∂x∂y21⋮∂x∂ym1∂x∂y12∂x∂y22⋮∂x∂ym2⋯⋯⋱⋯∂x∂y1n∂x∂y2n⋮∂x∂ymn
d X = [ d x 11 d x 12 ⋯ d x 1 n d x 21 d x 22 ⋯ d x 2 n ⋮ ⋮ ⋱ ⋮ d x m 1 d x m 2 ⋯ d x m n ] d \mathbf{X}=\left[\begin{array}{cccc} d x_{11} & d x_{12} & \cdots & d x_{1 n} \\ d x_{21} & d x_{22} & \cdots & d x_{2 n} \\ \vdots & \vdots & \ddots & \vdots \\ d x_{m 1} & d x_{m 2} & \cdots & d x_{m n} \end{array}\right] dX= dx11dx21⋮dxm1dx12dx22⋮dxm2⋯⋯⋱⋯dx1ndx2n⋮dxmn
∂ y ∂ x = [ ∂ y ∂ x 1 ∂ y ∂ x 2 ⋮ ∂ y ∂ x n ] \frac{\partial y}{\partial \mathbf{x}}=\left[\begin{array}{c} \frac{\partial y}{\partial x_1} \\ \frac{\partial y}{\partial x_2} \\ \vdots \\ \frac{\partial y}{\partial x_n} \end{array}\right] ∂x∂y= ∂x1∂y∂x2∂y⋮∂xn∂y
∂ y ∂ x = [ ∂ y 1 ∂ x ∂ y 2 ∂ x ⋯ ∂ y m ∂ x ] \frac{\partial \mathbf{y}}{\partial x}=\left[\begin{array}{llll} \frac{\partial y_1}{\partial x} & \frac{\partial y_2}{\partial x} & \cdots & \frac{\partial y_m}{\partial x} \end{array}\right] ∂x∂y=[∂x∂y1∂x∂y2⋯∂x∂ym]
∂ y ∂ x = [ ∂ y 1 ∂ x 1 ∂ y 2 ∂ x 1 ⋯ ∂ y m ∂ x 1 ∂ y 1 ∂ x 2 ∂ y 2 ∂ x 2 ⋯ ∂ y m ∂ x 2 ⋮ ⋮ ⋱ ⋮ ∂ y 1 ∂ x n ∂ y 2 ∂ x n ⋯ ∂ y m ∂ x n ] \frac{\partial \mathbf{y}}{\partial \mathbf{x}}=\left[\begin{array}{cccc} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\ \frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_2} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \end{array}\right] ∂x∂y= ∂x1∂y1∂x2∂y1⋮∂xn∂y1∂x1∂y2∂x2∂y2⋮∂xn∂y2⋯⋯⋱⋯∂x1∂ym∂x2∂ym⋮∂xn∂ym
∂ y ∂ X = [ ∂ y ∂ x 11 ∂ y ∂ x 12 ⋯ ∂ y ∂ x 1 q ∂ y ∂ x 21 ∂ y ∂ x 22 ⋯ ∂ y ∂ x 2 q ⋮ ⋮ ⋱ ⋮ ∂ y ∂ x p 1 ∂ y ∂ x p 2 ⋯ ∂ y ∂ x p q ] \frac{\partial y}{\partial \mathbf{X}}=\left[\begin{array}{cccc} \frac{\partial y}{\partial x_{11}} & \frac{\partial y}{\partial x_{12}} & \cdots & \frac{\partial y}{\partial x_{1 q}} \\ \frac{\partial y}{\partial x_{21}} & \frac{\partial y}{\partial x_{22}} & \cdots & \frac{\partial y}{\partial x_{2 q}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y}{\partial x_{p 1}} & \frac{\partial y}{\partial x_{p 2}} & \cdots & \frac{\partial y}{\partial x_{p q}} \end{array}\right] ∂X∂y= ∂x11∂y∂x21∂y⋮∂xp1∂y∂x12∂y∂x22∂y⋮∂xp2∂y⋯⋯⋱⋯∂x1q∂y∂x2q∂y⋮∂xpq∂y
推导1:
设 v = v ( x ) , u = u ( x ) v = v\left(\mathbf{x}\right),\mathbf{u}=\mathbf{u}\left(\mathbf{x}\right) v=v(x),u=u(x)
∂ v u ∂ x \frac{\partial v \mathbf{u}}{\partial \mathbf{x}} ∂x∂vu
分子布局:
∂ ( v u ) i ∂ x j = ∂ ( v u i ) ∂ x j = ∂ v ∂ x j u i + v ∂ u i ∂ x j = u i ( ∂ v ∂ x ) j + v ( ∂ u ∂ x ) i j \frac{\partial \left(v \mathbf{u}\right)_i}{\partial \mathbf{x}_j}=\frac{\partial \left(v \mathbf{u}_i\right)}{\partial \mathbf{x}_j}=\frac{\partial v}{\partial \mathbf{x}_j}\mathbf{u}_i + v\frac{\partial \mathbf{u}_i}{\partial \mathbf{x}_j}=\mathbf{u}_i\left(\frac{\partial v}{\partial \mathbf{x}}\right)_j +v\left(\frac{\partial \mathbf{u}}{\partial \mathbf{x}}\right)_{ij} ∂xj∂(vu)i=∂xj∂(vui)=∂xj∂vui+v∂xj∂ui=ui(∂x∂v)j+v(∂x∂u)ij
进而
∂ v u ∂ x = u ∂ v ∂ x + v ∂ u ∂ x \frac{\partial v \mathbf{u}}{\partial \mathbf{x}}=\mathbf{u}\frac{\partial v}{\partial \mathbf{x}}+v\frac{\partial \mathbf{u}}{\partial \mathbf{x}} ∂x∂vu=u∂x∂v+v∂x∂u
分母布局:
∂ ( v u ) j ∂ x i = ∂ ( v u j ) ∂ x i = ∂ v ∂ x i u j + v ∂ u j ∂ x i = ( ∂ v ∂ x ) i u j + v ( ∂ u ∂ x ) i j \frac{\partial \left(v \mathbf{u}\right)_j}{\partial \mathbf{x}_i}=\frac{\partial \left(v \mathbf{u}_j\right)}{\partial \mathbf{x}_i}=\frac{\partial v}{\partial \mathbf{x}_i}\mathbf{u}_j + v\frac{\partial \mathbf{u}_j}{\partial \mathbf{x}_i}=\left(\frac{\partial v}{\partial \mathbf{x}}\right)_i \mathbf{u}_j +v\left(\frac{\partial \mathbf{u}}{\partial \mathbf{x}}\right)_{ij} ∂xi∂(vu)j=∂xi∂(vuj)=∂xi∂vuj+v∂xi∂uj=(∂x∂v)iuj+v(∂x∂u)ij
∂ v u ∂ x = ∂ v ∂ x u T + v ∂ u ∂ x \frac{\partial v \mathbf{u}}{\partial \mathbf{x}}=\frac{\partial v}{\partial \mathbf{x}} \mathbf{u}^T+v\frac{\partial \mathbf{u}}{\partial \mathbf{x}} ∂x∂vu=∂x∂vuT+v∂x∂u
推导2:
设 g ( u ) : R n → R n \mathbf{g}\left(\mathbf{u}\right):\mathbb{R}^{n}\to\mathbb{R}^n g(u):Rn→Rn
则 ∂ g i ∂ x j = ∑ k ∂ g i ∂ u k ∂ u k ∂ x j \frac{\partial g_i}{\partial x_j}=\sum_{k}\frac{\partial g_i}{\partial u_k} \frac{\partial u_k}{\partial x_j} ∂xj∂gi=∑k∂uk∂gi∂xj∂uk
分子布局: ∂ g ∂ x = ∂ g ∂ u ∂ u ∂ x \frac{\partial \mathbf{g}}{\partial \mathbf{x}} = \frac{\partial \mathbf{g}}{\partial \mathbf{u}}\frac{\partial \mathbf{u}}{\partial \mathbf{x}} ∂x∂g=∂u∂g∂x∂u
例子:
l = ∥ X w − y ∥ 2 l=\|\mathbf{X}\mathbf{w}-\mathbf{y}\|^2 l=∥Xw−y∥2,其中 X ∈ R m × n , w , y ∈ R n \mathbf{X}\in\mathbb{R}^{m\times n},\mathbf{w},\mathbf{y}\in\mathbb{R}^n X∈Rm×n,w,y∈Rn,求 ∂ l ∂ w \frac{\partial l}{\partial \mathbf{w}} ∂w∂l
设 u = X w − y \mathbf{u} = \mathbf{X}\mathbf{w}-\mathbf{y} u=Xw−y
∂ l ∂ w = ∂ u ∂ w ∂ l ∂ u = X T 2 u = 2 X T ( X w − y ) \frac{\partial l}{\partial \mathbf{w}} = \frac{\partial \mathbf{u}}{\partial \mathbf{w}} \frac{\partial l}{\partial \mathbf{u}}=\mathbf{X}^T2\mathbf{u}=2\mathbf{X}^T\left( \mathbf{X}\mathbf{w}-\mathbf{y}\right) ∂w∂l=∂w∂u∂u∂l=XT2u=2XT(Xw−y)
分母布局
d f = ∑ i = 1 m ∑ i = 1 n ∂ f ∂ x i j d x i j = t r ( ∂ f ∂ x T d X ) \rm{d} f=\sum_{i=1}^{m}\sum_{i=1}^{n}\frac{\partial f}{\partial x_{ij}}\rm{d}x_{ij}=tr\left(\frac{\partial f}{\partial \mathbf{x}}^T\rm{d}\mathbf{X}\right) df=i=1∑mi=1∑n∂xij∂fdxij=tr(∂x∂fTdX)
法则:
d ( X ± Y ) = d X ± d Y \rm{d}\left(\mathbf{X} \pm \mathbf{Y}\right) = \rm{d}\mathbf{X} \pm \rm{d}\mathbf{Y} d(X±Y)=dX±dY
d ( X Y ) = d ( X ) Y + X d ( Y ) \rm{d}\left(\mathbf{X} \mathbf{Y}\right) =\rm{d}\left(\mathbf{X} \right) \mathbf{Y}+ \mathbf{X} \rm{d}\left(\mathbf{Y}\right) d(XY)=d(X)Y+Xd(Y)
d ( X T ) = ( d X ) T \rm{d}\left(\mathbf{X}^T\right)=\left(\rm{d} \mathbf{X}\right)^T d(XT)=(dX)T
d t r ( X ) = t r ( d X ) \rm{d} tr\left(\mathbf{X}\right)=tr\left(\rm{d} \mathbf{X}\right) dtr(X)=tr(dX)
d X − 1 = − X − 1 ( d X ) X − 1 \rm{d} \mathbf{X}^{-1}=-\mathbf{X}^{-1}\left(\rm{d}\mathbf{X}\right) \mathbf{X}^{-1} dX−1=−X−1(dX)X−1
d ∣ X ∣ = t r ( X ∗ d X ) = ∣ X ∣ t r ( X − 1 d X ) \rm{d}\left|\mathbf{X}\right|=tr\left(\mathbf{X}^{*}\rm{d}\mathbf{X}\right) = \left|\mathbf{X}\right|tr\left(\mathbf{X}^{-1}\rm{d}\mathbf{X}\right) d∣X∣=tr(X∗dX)=∣X∣tr(X−1dX)
d ( X ⊙ Y ) = d X ⊙ Y + X ⊙ d Y d(\mathbf{X} \odot \mathbf{Y})=d \mathbf{X} \odot \mathbf{Y}+\mathbf{X} \odot d \mathbf{Y} d(X⊙Y)=dX⊙Y+X⊙dY
d σ ( X ) = σ ′ ( X ) ⊙ d X d \sigma(\mathbf{X})=\sigma^{\prime}(\mathbf{X}) \odot d \mathbf{X} dσ(X)=σ′(X)⊙dX
技巧:
X = t r ( X ) \mathbf{X} = tr\left(\mathbf{X}\right) X=tr(X)
t r ( X T ) = t r ( X ) tr\left(\mathbf{X}^T\right)=tr\left(\mathbf{X}\right) tr(XT)=tr(X)
t r ( X ± Y ) = t r ( X ) ± t r ( Y ) tr\left(\mathbf{X} \pm \mathbf{Y}\right)=tr\left(\mathbf{X}\right) \pm tr\left(\mathbf{Y}\right) tr(X±Y)=tr(X)±tr(Y)
t r ( X Y ) = t r ( Y X ) tr\left(\mathbf{X}\mathbf{Y}\right) = tr\left(\mathbf{Y}\mathbf{X}\right) tr(XY)=tr(YX)
t r ( A T ( B ⊙ C ) ) = t r ( ( A ⊙ B ) T C ) tr\left(\mathbf{A}^T\left(\mathbf{B}\odot\mathbf{C}\right)\right)=tr\left(\left(\mathbf{A}\odot\mathbf{B}\right)^T\mathbf{C}\right) tr(AT(B⊙C))=tr((A⊙B)TC)
求导例子:
f = t r ( Y T M Y ) , Y = σ ( W X ) f = tr\left(\mathbf{Y}^T\mathbf{M}\mathbf{Y}\right),\mathbf{Y} = \sigma\left(\mathbf{W}\mathbf{X}\right) f=tr(YTMY),Y=σ(WX)
d f = t r ( d Y T M Y + Y T M d Y ) ⇒ ∂ f ∂ Y = M Y + M T Y \rm{d}f=tr\left(\rm{d}\mathbf{Y}^T\mathbf{M}\mathbf{Y}+\mathbf{Y}^T\mathbf{M}\rm{d}\mathbf{Y}\right)\Rightarrow\frac{\partial f}{\partial \mathbf{Y}}=\mathbf{M}\mathbf{Y}+\mathbf{M}^T\mathbf{Y} df=tr(dYTMY+YTMdY)⇒∂Y∂f=MY+MTY
d Y = t r ( σ ′ ( W X ) ⊙ ( W d X ) ) \rm{d}\mathbf{Y} = tr\left(\sigma^{\prime}\left(\mathbf{W}\mathbf{X}\right)\odot\left( \mathbf{W}\rm{d}\mathbf{X}\right)\right) dY=tr(σ′(WX)⊙(WdX))
d f = t r ( ∂ f ∂ Y T d Y ) = t r ( ∂ f ∂ Y T σ ′ ( W X ) ⊙ ( W d X ) ) = t r ( ( ∂ f ∂ Y ⊙ σ ′ ( W X ) ) T ( W d X ) ) \rm{d}f=tr\left(\frac{\partial f}{\partial\mathbf{Y}}^T\mathbf{d}\mathbf{Y}\right)=tr\left(\frac{\partial f}{\partial\mathbf{Y}}^T\sigma^{\prime}\left(\mathbf{W}\mathbf{X}\right)\odot\left( \mathbf{W}\rm{d}\mathbf{X}\right)\right)=tr\left(\left(\frac{\partial f}{\partial\mathbf{Y}}\odot\sigma^{\prime}\left(\mathbf{W}\mathbf{X}\right)\right)^T\left( \mathbf{W}\rm{d}\mathbf{X}\right)\right) df=tr(∂Y∂fTdY)=tr(∂Y∂fTσ′(WX)⊙(WdX))=tr((∂Y∂f⊙σ′(WX))T(WdX))
于是
∂ f ∂ X = W T ( ( M Y + M T Y ) ⊙ σ ′ ( W X ) ) \frac{\partial f}{\partial \mathbf{X}}=\mathbf{W}^T\left(\left(\mathbf{M}\mathbf{Y}+\mathbf{M}^T\mathbf{Y}\right)\odot\sigma^{\prime}\left(\mathbf{W}\mathbf{X}\right)\right) ∂X∂f=WT((MY+MTY)⊙σ′(WX))
参考:
https://zhuanlan.zhihu.com/p/24709748
https://en.wikipedia.org/wiki/Matrix_calculus#convert_differential_derivative