参考资料
1. 机器学习中的矩阵、向量求导
求导结果与函数(矩阵)同型,即导数结果的每个元素就是矩阵相应分量对标量的求导。若函数矩阵 f \boldsymbol f f是一个 m × n m\times n m×n维矩阵,则求导结果也是一个 m × n m\times n m×n维矩阵,其中
( ∂ f ∂ x ) i j = ∂ f i j ∂ x \left(\frac{\partial\boldsymbol f}{\partial x}\right)_{ij}=\frac{\partial f_{ij}}{\partial x} (∂x∂f)ij=∂x∂fij
特别地,对于 n n n维向量,其对标量自变量的导数为
y = ( y 1 , ⋯ , y n ) ⟹ ∂ y ∂ x = ( ∂ y 1 ∂ x , ⋯ , ∂ y n ∂ x ) \boldsymbol y=(y_1,\cdots,y_n) \implies \frac{\partial\boldsymbol y}{\partial x}=\left(\frac{\partial y_1}{\partial x},\cdots,\frac{\partial y_n}{\partial x}\right) y=(y1,⋯,yn)⟹∂x∂y=(∂x∂y1,⋯,∂x∂yn)
求导结果与自变量(矩阵)同型,若自变量矩阵 X X X是一个 m × n m\times n m×n维矩阵,则求导结果也是一个 m × n m\times n m×n维矩阵,其中
( ∂ f ∂ X ) i j = ∂ f ∂ x i j \left(\frac{\partial f}{\partial X}\right)_{ij}=\frac{\partial f}{\partial x_{ij}} (∂X∂f)ij=∂xij∂f
特别地,标量函数 f f f对于 n n n维向量 x \boldsymbol x x的导数为
∇ x f = ( ∂ f ∂ x 1 , ⋯ , ∂ f ∂ x n ) ⊤ \nabla_{\boldsymbol x} f= \left(\frac{\partial f}{\partial x_1},\cdots,\frac{\partial f}{\partial x_n}\right)^\top ∇xf=(∂x1∂f,⋯,∂xn∂f)⊤
若函数值 f \boldsymbol f f是一个 m m m维向量,自变量 x x x是 n n n维向量,则求导结果是 m × n m\times n m×n维矩阵,其中
∂ f ∂ x = ( ∂ f ∂ x 1 , ⋯ , ∂ f ∂ x n ) , ( ∂ f ∂ x ) i j = ∂ f i ∂ x j \frac{\partial\boldsymbol f}{\partial\boldsymbol x}= \left(\frac{\partial\boldsymbol f}{\partial x_1},\cdots,\frac{\partial\boldsymbol f}{\partial x_n}\right),\quad \left(\frac{\partial\boldsymbol f}{\partial\boldsymbol x}\right)_{ij}=\frac{\partial f_i}{\partial x_j} ∂x∂f=(∂x1∂f,⋯,∂xn∂f),(∂x∂f)ij=∂xj∂fi
特殊地,当函数值 f \boldsymbol f f为标量时,雅克比矩阵是一个行向量,这与标量对向量的求导结果不一致,即
∇ x f = ( ∇ x f ) ⊤ = ( ∂ f ∂ x 1 , ⋯ , ∂ f ∂ x n ) \nabla_{\boldsymbol x}\boldsymbol f=(\nabla_{\boldsymbol x}f)^\top =\left(\frac{\partial f}{\partial x_1},\cdots,\frac{\partial f}{\partial x_n}\right) ∇xf=(∇xf)⊤=(∂x1∂f,⋯,∂xn∂f)
若中间变量都是向量,假设变量存在依赖关系 x → v → u → f \boldsymbol x\to\boldsymbol v\to\boldsymbol u\to\boldsymbol f x→v→u→f,则
∂ f ∂ x = ∂ f ∂ u ∂ u ∂ v ∂ v ∂ x \frac{\partial\boldsymbol f}{\partial\boldsymbol x}=\frac{\partial\boldsymbol f}{\partial\boldsymbol u}\frac{\partial\boldsymbol u}{\partial\boldsymbol v}\frac{\partial\boldsymbol v}{\partial\boldsymbol x} ∂x∂f=∂u∂f∂v∂u∂x∂v
若结果变量 f f f是标量,则
∂ f ∂ x = ∂ f ∂ x ⊤ = ∂ f ∂ u ⊤ ∂ u ∂ v ∂ v ∂ x \frac{\partial\boldsymbol f}{\partial\boldsymbol x}=\frac{\partial f}{\partial\boldsymbol x^\top}=\frac{\partial f}{\partial\boldsymbol u^\top}\frac{\partial\boldsymbol u}{\partial\boldsymbol v}\frac{\partial\boldsymbol v}{\partial\boldsymbol x} ∂x∂f=∂x⊤∂f=∂u⊤∂f∂v∂u∂x∂v
以上结果,可用于RNN的BPTT推导。
迹的基本性质
迹(标量)的导数
证明:
∇ t r ( X A X ⊤ B ) = ∇ X 1 t r ( X 1 A X 2 ⊤ B ) + ∇ X 2 t r ( X 1 A X 2 ⊤ B ) = ∇ X 1 t r ( A X 2 ⊤ B X 1 ) + ∇ X 2 t r ( B X 1 A X 2 ⊤ ) = B ⊤ X 2 A ⊤ + B X 1 A = B ⊤ X A ⊤ + B X A \begin{aligned} \nabla tr(XAX^\top B) &=\nabla_{X_1} tr(X_1AX_2^\top B)+\nabla_{X_2} tr(X_1AX_2^\top B)\\ &=\nabla_{X_1} tr(AX_2^\top BX_1)+\nabla_{X_2} tr(BX_1AX_2^\top )\\ &=B^\top X_2A^\top+BX_1A\\ &=B^\top XA^\top+BXA\\ \end{aligned} ∇tr(XAX⊤B)=∇X1tr(X1AX2⊤B)+∇X2tr(X1AX2⊤B)=∇X1tr(AX2⊤BX1)+∇X2tr(BX1AX2⊤)=B⊤X2A⊤+BX1A=B⊤XA⊤+BXA
与迹有关的导数
标量对向量的求导,可以用迹相关的性质。