标量函数对矩阵的梯度
定义
实值函数 f ( X ) f(\mathbf{X}) f(X)对变元 X ∈ R m × n \mathbf{X}\in\mathbb{R}^{m\times n} X∈Rm×n的导数有两种定义方式:
- Jacobi矩阵:
D X f ( X ) = ∂ f ( X ) ∂ X T = ( ∂ f ( X ) ∂ x j i ) n × m D_{\mathbf{X}}f(\mathbf{X})=\frac{\partial f(\mathbf{X})}{\partial\mathbf{X}^T} =(\frac{\partial f(\mathbf{X})}{\partial x_{ji}})_{n\times m} DXf(X)=∂XT∂f(X)=(∂xji∂f(X))n×m
实际上一般对矩阵求导的时候更方便用梯度矩阵,即Jacobi矩阵的转置:
∇ X f ( X ) = ∂ f ( X ) ∂ X = ( ∂ f ( X ) ∂ x i j ) m × n \nabla_{\mathbf{X}}f(\mathbf{X})=\frac{\partial f(\mathbf{X})}{\partial\mathbf{X}}=(\frac{\partial f(\mathbf{X})}{\partial x_{ij}})_{m\times n} ∇Xf(X)=∂X∂f(X)=(∂xij∂f(X))m×n
- 行向量偏导:
v e c X = [ x 11 , x 21 , … , x m 1 , … , x 1 n , … , x m n ] T vec\mathbf{X}=[x_{11},x_{21},\dots,x_{m1},\dots,x_{1n},\dots,x_{mn}]^T vecX=[x11,x21,…,xm1,…,x1n,…,xmn]T
D v e c X f ( X ) = ∂ f ( X ) ∂ ( v e c X ) T = [ ∂ f ( X ) ∂ x 11 , … , ∂ f ( X ) ∂ x m 1 , … , ∂ f ( X ) ∂ x 1 n , … , ∂ f ( X ) ∂ x m n ] D_{vec\mathbf{X}}f(\mathbf{X})=\frac{\partial f(\mathbf{X})}{\partial (vec\mathbf{X})^T}=\left[ \frac{\partial f(\mathbf{X})}{\partial x_{11}}, \dots, \frac{\partial f(\mathbf{X})}{\partial x_{m1}},\dots,\frac{\partial f(\mathbf{X})}{\partial x_{1n}},\dots,\frac{\partial f(\mathbf{X})}{\partial x_{mn}} \right] DvecXf(X)=∂(vecX)T∂f(X)=[∂x11∂f(X),…,∂xm1∂f(X),…,∂x1n∂f(X),…,∂xmn∂f(X)]
Jacobi矩阵沿水平方向展开就可以得到行向量偏导,所以说我们对一个实值函数求 m × n m\times n m×n阶矩阵的偏导,实际上与求对应的 m n mn mn维向量的梯度无异,所以参照梯度的用途,我们令 ∇ X f ( X ) = 0 \nabla_{\mathbf{X}}f(\mathbf{X})=0 ∇Xf(X)=0,就得到了一个 f ( X ) f(\mathbf{X}) f(X)取最大值的必要条件(除去不连续点)。
BTW,Wiki上的Matrix Calculus里提到Numerator layout与Denominator layout就分别对应这里定义的Jacobi矩阵和梯度矩阵。
求梯度矩阵的方法
逐个元素进行比较
Ex 1: A , B ∈ R n × n \mathbf{A,B}\in\mathbb{R}^{n\times n} A,B∈Rn×n,show that ∂ t r ( A B ) ∂ A = B T \frac{\partial tr(\mathbf{AB})}{\partial\mathbf{A}}=\mathbf{B}^T ∂A∂tr(AB)=BT
Pf: t r ( A B ) = ∑ i , j a i j b j i tr(AB)=\sum\limits_{i,j}a_{ij}b_{ji} tr(AB)=i,j∑aijbji,so ∂ t r ( A B ) ∂ a i j = b j i \frac{\partial tr(AB)}{\partial a_{ij}}=b_{ji} ∂aij∂tr(AB)=bji, then by definition ∂ t r ( A B ) ∂ A = B T \frac{\partial tr(\mathbf{AB})}{\partial\mathbf{A}}=\mathbf{B}^T ∂A∂tr(AB)=BT
Ex 2: f ( X ) = a T X X T b f(\mathbf{X})=\bm{a}^T\mathbf{X}\mathbf{X}^T\bm{b} f(X)=aTXXTb,求对 X \mathbf{X} X的梯度矩阵
Sol: f ( X ) = a T X X T b = ∑ k = 1 m ∑ l = 1 m a k ( ∑ p = 1 n x k p x l p ) b l = ∑ k , l , p a k x k p x l p b l f(\mathbf{X})=\bm{a}^T\mathbf{X}\mathbf{X}^T\bm{b}=\sum\limits_{k=1}^m\sum\limits_{l=1}^m a_k(\sum\limits_{p=1}^n x_{kp}x_{lp})b_l=\sum\limits_{k,l,p}a_k x_{kp}x_{lp}b_l f(X)=aTXXTb=k=1∑ml=1∑mak(p=1∑nxkpxlp)bl=k,l,p∑akxkpxlpbl
∂ f ( X ) x j i = ∑ k , l , p [ a k x l p b l ∂ x k p ∂ x j i + a k x k p b l x l p ∂ x j i ] \frac{\partial f(\mathbf{X})}{x_{ji}}=\sum\limits_{k,l,p}[a_kx_{lp}b_l\frac{\partial x_{kp}}{\partial x_{ji}}+a_kx_{kp}b_l\frac{x_{lp}}{\partial x_{ji}}] xji∂f(X)=k,l,p∑[akxlpbl∂xji∂xkp+akxkpbl∂xjixlp]
∂ f ( X ) x j i = ∑ k , l , p [ a k x l p b l ∂ x k p ∂ x j i + a k x k p b l x l p ∂ x j i ] = ∑ i , l , j a j x l i b l + ∑ k , i , j a k x k i b j = ∑ i , j [ X T b ] i a j + [ X T a ] i b j \begin{aligned} \frac{\partial f(\mathbf{X})}{x_{ji}} &= \sum\limits_{k,l,p}[a_kx_{lp}b_l\frac{\partial x_{kp}}{\partial x_{ji}}+a_kx_{kp}b_l\frac{x_{lp}}{\partial x_{ji}}]\\ &=\sum\limits_{i,l,j}a_jx_{li}b_l+\sum\limits_{k,i,j}a_kx_{ki}b_j\\ &=\sum\limits_{i,j}[\mathbf{X}^T\bm{b}]_ia_j + [\mathbf{X}^T\bm{a}]_ib_j \end{aligned} xji∂f(X)=k,l,p∑[akxlpbl∂xji∂xkp+akxkpbl∂xjixlp]=i,l,j∑ajxlibl+k,i,j∑akxkibj=i,j∑[XTb]iaj+[XTa]ibj
故 D X f ( X ) = X T ( b a T + a b T ) D_\mathbf{X}f(\mathbf{X})=\mathbf{X}^T(\bm{ba}^T+\bm{ab}^T) DXf(X)=XT(baT+abT) 及 ∇ X f ( X ) = ( a b T + b a T ) X \nabla_{\mathbf{X}}f(\mathbf{X})=(\bm{ab}^T+\bm{ba}^T)\mathbf{X} ∇Xf(X)=(abT+baT)X
利用矩阵微分与迹函数
(注意:下面的性质和定理虽然都是对矩阵讨论,但是向量情形完全适用)
矩阵微分用 d X d\mathbf{X} dX表示,定义为 d X = [ d X i j ] i = 1 , j = 1 m , n d\mathbf{X}=[dX_{ij}]_{i=1,j=1}^{m,n} dX=[dXij]i=1,j=1m,n,是 R m × n \mathbb{R}^{m\times n} Rm×n上的一个线性算子,提一下几个性质,我们会用到前4条:
- d ( X T ) = ( d X ) T d(\mathbf{X}^T) = (d\mathbf{X})^T d(XT)=(dX)T
- d ( A X B ) = A ( d X ) B d(\mathbf{A}\mathbf{X}\mathbf{B})=\mathbf{A}(d\mathbf{X})\mathbf{B} d(AXB)=A(dX)B
- d ( t r ( X ) ) = t r ( d X ) d(tr(\mathbf{X}))=tr(d\mathbf{X}) d(tr(X))=tr(dX)
- d ( v e c ( X ) ) = v e c ( d X ) d(vec(\mathbf{X}))=vec(d\mathbf{X}) d(vec(X))=vec(dX)
- d ∣ X ∣ = ∣ X ∣ t r ( X − 1 d X ) d|\mathbf{X}|=|\mathbf{X}|tr(\mathbf{X}^{-1}d\mathbf{X}) d∣X∣=∣X∣tr(X−1dX)
- d ( X − 1 ) = − X − 1 ( d X ) X − 1 d(\mathbf{X}^{-1})=-\mathbf{X}^{-1}(d\mathbf{X})\mathbf{X}^{-1} d(X−1)=−X−1(dX)X−1(这个式子在ode习题中坑过我,记忆犹新)
接下来是本文最重要的一个定理:
定理: d f ( X ) = t r ( A T d X ) ⇔ ∂ f ( X ) ∂ X = A df(\mathbf{X})=tr(\mathbf{A}^Td\mathbf{X}) \Leftrightarrow \frac{\partial f(\mathbf{X})}{\partial \mathbf{X}}=\mathbf{A} df(X)=tr(ATdX)⇔∂X∂f(X)=A
证明: d f ( X ) = ∂ f ( X ) ∂ x 1 d x 1 + ⋯ + ∂ f ( X ) ∂ x n d x n = [ ∂ f ( X ) ∂ x 11 , … , ∂ f ( X ) ∂ x m 1 ] [ d x 11 ⋮ d x m 1 ] + ⋯ + [ ∂ f ( X ) ∂ x 1 n , … , ∂ f ( X ) ∂ x m n ] [ d x 1 n ⋮ d x m n ] = [ ∂ f ( X ) ∂ x 11 , … , ∂ f ( X ) ∂ x m 1 , … , ∂ f ( X ) ∂ x 1 n , … , ∂ f ( X ) ∂ x m n ] [ d x 1 n ⋮ d x m 1 ⋮ d x 1 n ⋮ d x m n ] = D v e c X f ( X ) d ( v e c X ) \begin{aligned} df(\mathbf{X}) &= \frac{\partial f(\mathbf{X})}{\partial \bm{x}_1}d\bm{x}_1+\dots+\frac{\partial f(\mathbf{X})}{\partial \bm{x_n}}d\bm{x_n}\\ &=[\frac{\partial f(\mathbf{X})}{\partial x_{11}},\dots,\frac{\partial f(\mathbf{X})}{\partial x_{m1}}]\left[ \begin{matrix} dx_{11}\\ \vdots \\dx_{m1} \end{matrix}\right]+\dots+[\frac{\partial f(\mathbf{X})}{\partial x_{1n}},\dots,\frac{\partial f(\mathbf{X})}{\partial x_{mn}}]\left[ \begin{matrix} dx_{1n}\\ \vdots \\dx_{mn} \end{matrix}\right]\\ &=\left[ \frac{\partial f(\mathbf{X})}{\partial x_{11}},\dots,\frac{\partial f(\mathbf{X})}{\partial x_{m1}},\dots,\frac{\partial f(\mathbf{X})}{\partial x_{1n}},\dots,\frac{\partial f(\mathbf{X})}{\partial x_{mn}} \right]\left[ \begin{matrix} dx_{1n}\\ \vdots \\dx_{m1}\\ \vdots \\ dx_{1n} \\ \vdots \\ dx_{mn} \end{matrix}\right]\\ &=D_{vec\mathbf{X}}f(\mathbf{X})d(vec\mathbf{X}) \end{aligned} df(X)=∂x1∂f(X)dx1+⋯+∂xn∂f(X)dxn=[∂x11∂f(X),…,∂xm1∂f(X)]⎣⎢⎡dx11⋮dxm1⎦⎥⎤+⋯+[∂x1n∂f(X),…,∂xmn∂f(X)]⎣⎢⎡dx1n⋮dxmn⎦⎥⎤=[∂x11∂f(X),…,∂xm1∂f(X),…,∂x1n∂f(X),…,∂xmn∂f(X)]⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡dx1n⋮dxm1⋮dx1n⋮dxmn⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤=DvecXf(X)d(vecX)
设 A \mathbf{A} A为 f ( X ) f(\mathbf{X}) f(X)的梯度矩阵,即
A = ∂ f ( X ) ∂ X \mathbf{A}=\frac{\partial f(\mathbf{X})}{\partial \mathbf{X}} A=∂X∂f(X)
则由梯度矩阵和行向量偏导的关系得 D v e c X f ( X ) = ( v e c ( ∇ X f ( X ) ) ) T D_{vec\mathbf{X}}f(\mathbf{X})=(vec(\nabla_{\mathbf{X}}f(\mathbf{X})))^T DvecXf(X)=(vec(∇Xf(X)))T,结合微分的性质4得
d f ( X ) = ( v e c A ) T v e c ( d X ) df(\mathbf{X})=(vec\mathbf{A})^Tvec(d\mathbf{X}) df(X)=(vecA)Tvec(dX)
向量化算子vec和迹函数之间关系易证: t r ( B T C ) = ∑ i , j b i j c i j = ( v e c B ) T v e c C tr(\mathbf{B}^T\mathbf{C})=\sum\limits_{i,j}b_{ij}c_{ij}=(vec\mathbf{B})^Tvec\mathbf{C} tr(BTC)=i,j∑bijcij=(vecB)TvecC
所以 d f ( X ) = t r ( A T d X ) df(\mathbf{X})=tr(\mathbf{A}^Td\mathbf{X}) df(X)=tr(ATdX)
Ex 3: 求 t r ( X T A X ) tr(\mathbf{X}^T\mathbf{A}\mathbf{X}) tr(XTAX)的梯度矩阵
Sol: d t r ( X T A X ) = t r ( d ( X T A X ) ) = t r ( ( d X ) T A X + X T A d X ) = t r ( ( d X ) T A X ) + t r ( X T A d X ) = t r ( ( A X ) T d X ) + t r ( X T A d X ) = t r ( X T ( A T + A ) d X ) \begin{aligned} dtr(\mathbf{X}^T\mathbf{A}\mathbf{X}) &= tr(d(\mathbf{X}^T\mathbf{A}\mathbf{X}))\\ &=tr((d\mathbf{X})^T\mathbf{A}\mathbf{X}+\mathbf{X}^T\mathbf{A}d\mathbf{X})\\ &= tr((d\mathbf{X})^T\mathbf{A}\mathbf{X})+tr(\mathbf{X}^T\mathbf{A}d\mathbf{X})\\ &= tr((\mathbf{A}\mathbf{X})^Td\mathbf{X})+tr(\mathbf{X}^T\mathbf{A}d\mathbf{X})\\ &=tr(\mathbf{X}^T(\mathbf{A}^T+\mathbf{A})d\mathbf{X}) \end{aligned} dtr(XTAX)=tr(d(XTAX))=tr((dX)TAX+XTAdX)=tr((dX)TAX)+tr(XTAdX)=tr((AX)TdX)+tr(XTAdX)=tr(XT(AT+A)dX)
所以由定理知, ∂ t r ( X T A X ) ∂ X = ( A + A T ) X \frac{\partial tr(\mathbf{X}^T\mathbf{A}\mathbf{X})}{\partial \mathbf{X}}=(\mathbf{A}+\mathbf{A}^T)\mathbf{X} ∂X∂tr(XTAX)=(A+AT)X
这个方法具有极强的实用性,原因在于:
- 标量函数 f ( X ) f(\mathbf{X}) f(X)总可以表示成迹函数的形式: t r ( f ( X ) ) tr(f(\mathbf{X})) tr(f(X))
- 迹函数的交换顺序不变性和求转置不变性的性质可以保证微分的结果可以转化成形式 t r ( A d X ) tr(\mathbf{A}d\mathbf{X}) tr(AdX)
更多例子可以看《矩阵分析与应用》(清华大学出版社)这本书,以及参考Wiki-Matrix Calculus