本系列主要参考张贤达的《矩阵分析与应用》第三章 矩阵微分和下面的博客内容进行学习,并整理成学习笔记。学习路线参考SinclairWang的文章——矩阵求导入门学习路线,按下面推荐顺序学习,效果更佳。
简介 | 链接 |
---|---|
矩阵求导——本质篇:矩阵求导中分子布局、分母布局的本质 | https://zhuanlan.zhihu.com/p/263777564 |
矩阵求导——基础篇:基本的向量/矩阵变元的实值标量函数的求导 | https://zhuanlan.zhihu.com/p/273729929 |
矩阵求导——进阶篇: 矩阵的迹 t r ( A ) tr(\pmb{A}) tr(AAA) 与一阶实矩阵微分 d X d\pmb{X} dXXX | https://zhuanlan.zhihu.com/p/288541909 |
矩阵求导术(上):标量对矩阵的求导 | https://zhuanlan.zhihu.com/p/24709748 |
矩阵求导术(下):矩阵对矩阵的求导 | https://zhuanlan.zhihu.com/p/24863977 |
矩阵微分(Matrix Differential)也称矩阵求导(Matrix Derivative),在机器学习、图像处理、最优化等领域的公式推导过程中经常用到。本篇将对各种形式下的矩阵求导进行简单的介绍。
本文使用小写字母 x x x 表示标量,粗体小写字母 x \pmb{x} xxx 表示(列)向量,粗体大写字母 X \pmb{X} XXX 表示矩阵。
本篇主要讨论实值标量函数、实值向量函数和实值矩阵函数相对于实向量变元或矩阵变元的偏导。为了方便理解,首先对变元和函数作统一的符号规定:
x = [ x 1 , x 2 , ⋯ , x m ] T ∈ R m \pmb{x}=[x_1,x_2,\cdots,x_m]^T \in \mathbb{R}^m xxx=[x1,x2,⋯,xm]T∈Rm 为实向量变元; X = [ x 1 , x 2 , ⋯ , x n ] T ∈ R m × n \pmb{X}=[\pmb{x_1},\pmb{x_2},\cdots,\pmb{x_n}]^T \in \mathbb{R}^{m \times n} XXX=[x1x1x1,x2x2x2,⋯,xnxnxn]T∈Rm×n 为实矩阵变元。
实值函数,是指这样的函数 f ( X ) : X → Y f(X): X \to Y f(X):X→Y,其中 Y Y Y 是实数集 R R R, X X X 可以是复数域的子集。“实值函数”是指函数值是“实数”,不可以取虚数或 ± ∞ ±\infty ±∞ 的。实值函数有以下分类:
函数类型\变元类型 | 标量变元 x ∈ R x \in \mathbb{R} x∈R | 向量变元 x ∈ R m \boldsymbol{x} \in \mathbb{R}^m x∈Rm | 矩阵变元 X ∈ R m × n \boldsymbol{X} \in \mathbb{R}^{m \times n} X∈Rm×n |
---|---|---|---|
标量函数 f ∈ R f \in \mathbb{R} f∈R | f ( x ) { f : R → R } f(x) \quad \{f: \mathbb{R} \to \mathbb{R}\} f(x){f:R→R} | f ( x ) { f : R m → R } f(\boldsymbol{x}) \quad \{ f: \mathbb{R}^m \to \mathbb{R}\} f(x){f:Rm→R} | f ( X ) { f : R m × n → R } f(\boldsymbol{X}) \quad \{ f: \mathbb{R}^{m \times n} \to \mathbb{R}\} f(X){f:Rm×n→R} |
向量函数 f ∈ R p \boldsymbol{f} \in \mathbb{R}^p f∈Rp | f ( x ) { f : R → R p } \boldsymbol{f}(x) \quad \{\boldsymbol{f}: \mathbb{R} \to \mathbb{R}^p\} f(x){f:R→Rp} | f ( x ) { f : R m → R p } \boldsymbol{f}(\boldsymbol{x}) \quad \{\boldsymbol{f}: \mathbb{R}^m \to \mathbb{R}^p\} f(x){f:Rm→Rp} | f ( X ) { f : R m × n → R p } \boldsymbol{f}(\boldsymbol{X}) \quad \{\boldsymbol{f}: \mathbb{R}^{m \times n} \to \mathbb{R}^{p}\} f(X){f:Rm×n→Rp} |
矩阵函数 F ∈ R p × q \boldsymbol{F} \in \mathbb{R}^{p \times q} F∈Rp×q | F ( x ) { F : R → R p × q } \boldsymbol{F}(x) \quad \{\boldsymbol{F}: \mathbb{R} \to \mathbb{R}^{p \times q}\} F(x){F:R→Rp×q} | F ( x ) { F : R m → R p × q } \boldsymbol{F}(\boldsymbol{x}) \quad \{\boldsymbol{F}: \mathbb{R}^m \to \mathbb{R}^{p \times q}\} F(x){F:Rm→Rp×q} | F ( X ) { F : R m × n → R p × q } \boldsymbol{F}(\boldsymbol{X}) \quad \{\boldsymbol{F}: \mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q}\} F(X){F:Rm×n→Rp×q} |
矩阵微分是实函数微分对矩阵函数的推广,矩阵求导为标量函数,向量函数和矩阵函数的梯度矩阵与Hessian矩阵的计算提供了便捷的算法。
在高数里,我们学习过多元函数求偏导,如: f ( x 1 , x 2 , x 3 ) = ( x 1 ) 2 + ( x 2 ) 2 + ( x 3 ) 2 + x 1 x 2 x 3 f(x_1, x_2, x_3) = (x_1)^2+(x_2)^2+(x_3)^2+x_1x_2x_3 f(x1,x2,x3)=(x1)2+(x2)2+(x3)2+x1x2x3,将 f f f 对 x 1 , x 2 , x 3 x_1, x_2, x_3 x1,x2,x3 分别求偏导,
{ ∂ f ∂ x 1 = 2 x 1 + x 2 x 3 ∂ f ∂ x 2 = 2 x 2 + x 1 x 3 ∂ f ∂ x 3 = 2 x 3 + x 1 x 2 (2-1) \begin{cases} \frac{\partial f}{\partial x_1} = 2x_1+ x_2x_3 \\ \frac{\partial f}{\partial x_2} = 2x_2+ x_1x_3 \\ \frac{\partial f}{\partial x_3} = 2x_3+ x_1x_2 \\ \end{cases} \tag{2-1} ⎩⎪⎨⎪⎧∂x1∂f=2x1+x2x3∂x2∂f=2x2+x1x3∂x3∂f=2x3+x1x2(2-1)
并把这组标量写成向量的形式,即可得到一个标量 f f f 对一个 3 3 3 维向量 x \boldsymbol{x} x 的求导,则结果也是一个 3 3 3 维的向量: ∂ y ∂ x \frac{\partial y}{\partial \boldsymbol{x}} ∂x∂y。矩阵求导也是类似的,本质就是 F \boldsymbol{F} F 中的每个 f f f 分别对变元中的每个元素逐个求偏导,只不过写成了向量、矩阵形式而已。因此,所谓标量对向量的求导,其实就是标量对向量里的每个分量分别求导,最后把求导的结果排列在一起,按一个向量表示而已。 类似的结论也存在于向量对标量的求导,向量对向量的求导,向量对矩阵的求导,矩阵对向量的求导,以及矩阵对矩阵的求导等。
所谓的矩阵求导本质上就是把因变量的每个元素逐个对自变量的元素求导,并把求导的结果排列成了矩阵的形式。但逐元素求导破坏了整体性,使用矩阵运算也更整洁。所以在求导时不宜拆开矩阵,而是要找一个从整体出发的算法。
根据求导的自变量和因变量是标量,向量还是矩阵,我们有9种可能的矩阵求导定义,如下:
因变量\自变量 | 标量 x x x | 向量 x \pmb{x} xxx | 矩阵 X \pmb{X} XXX |
---|---|---|---|
标量 f f f | ∂ f ∂ x \frac{\partial f}{\partial x} ∂x∂f | ∂ f ∂ x \dfrac{\partial {f}}{\partial \boldsymbol{x}} ∂x∂f | ∂ f ∂ X \dfrac{\partial {f}}{\partial \boldsymbol{X}} ∂X∂f |
向量 f \boldsymbol{f} f | ∂ f ∂ x \frac{\partial \boldsymbol{f}}{\partial {x}} ∂x∂f | ∂ f ∂ x \dfrac{ \partial \boldsymbol{f}}{\partial \boldsymbol{x}} ∂x∂f | ∂ F ∂ X \dfrac{ \partial \boldsymbol{F}}{\partial \boldsymbol{X}} ∂X∂F |
矩阵 F \boldsymbol{F} F | ∂ F ∂ x \dfrac{\partial \boldsymbol{F}}{\partial {x}} ∂x∂F | ∂ F ∂ x \dfrac{ \partial \boldsymbol{F}}{\partial \boldsymbol{x}} ∂x∂F | ∂ F ∂ X \dfrac{ \partial \boldsymbol{F}}{\partial \boldsymbol{X}} ∂X∂F |
这里我们主要以实值标量函数对向量变元、矩阵变元和矩阵函数对矩阵的求导为主。
对于上面多元函数求导的例子,也许会想是写成行向量还是列向量的形式呢?当分子分母都是向量,且一个是行向量,另一个是列向量,或者分子分母一个是标量,另一个是行向量或列向量,我们才会讨论求导的布局。最基本的求导布局有两个:分子布局(numerator layout)和分母布局(denominator layout )。
对于分子布局来说,我们求导结果的维度以分子为主,即结果的维度和分子的维度是一致的。对于分母布局来说,我们求导结果的维度以分母为主。但是对于某一种求导类型,不能同时使用分子布局和分母布局求导。通常,对于分子布局和分母布局的结果来说,两者相差一个转置。 在实际应用中,一般来说我们会使用一种叫混合布局的思路,即如果是向量或者矩阵对标量求导,则使用分子布局为准,如果是标量对向量或者矩阵求导,则以分母布局为准。下面分析具体的矩阵求导结果布局。
设 f ( x ) , x = [ x 1 , x 2 , ⋯ , x m ] T f(\pmb{x}),\pmb{x}=[x_1,x_2,\cdots,x_m]^T f(xxx),xxx=[x1,x2,⋯,xm]T,定义行向量偏导算子为:
D x = d e f ∂ ∂ x T = [ ∂ ∂ x 1 , ∂ ∂ x 2 , ⋯ , ∂ ∂ x m ] (2-2) D_{\boldsymbol{x}} \overset{def}{=} \dfrac{\partial}{\partial \boldsymbol{x}^T} = \left[ \frac{\partial}{\partial x_1}, \frac{\partial}{\partial x_2}, \cdots, \frac{\partial}{\partial x_m} \right] \tag{2-2} Dx=def∂xT∂=[∂x1∂,∂x2∂,⋯,∂xm∂](2-2)
列向量偏导算子即梯度算子定义为:
∇ x = d e f ∂ ∂ x = [ ∂ ∂ x 1 , ∂ ∂ x 2 , ⋯ , ∂ ∂ x m ] T (2-3) \nabla_{\boldsymbol{x}} \overset{def}{=} \dfrac{\partial}{\partial \boldsymbol{x}} = \left[ \frac{\partial}{\partial x_1}, \frac{\partial}{\partial x_2}, \cdots, \frac{\partial}{\partial x_m} \right]^T \tag{2-3} ∇x=def∂x∂=[∂x1∂,∂x2∂,⋯,∂xm∂]T(2-3)
1. 行向量偏导形式
D x f ( x ) = ∂ f ( x ) ∂ x T = [ ∂ f ∂ x 1 , ∂ f ∂ x 2 , ⋯ , ∂ f ∂ x m ] (2-4) \text{D}_{\boldsymbol{x}}f(\pmb{x})= \frac{\partial f(\pmb{x})}{\partial \pmb{x}^T}= \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \cdots, \frac{\partial f}{\partial x_m} \right] \tag{2-4} Dxf(xxx)=∂xxxT∂f(xxx)=[∂x1∂f,∂x2∂f,⋯,∂xm∂f](2-4)
式(2-4)又称为行偏导向量形式,为了方便,下面统一称作行向量偏导形式。
2. 梯度向量形式
∇ x f ( x ) = ∂ f ( x ) ∂ x = [ ∂ f ∂ x 1 , ∂ f ∂ x 2 , ⋯ , ∂ f ∂ x m ] T (2-5) \nabla_{\boldsymbol{x}}f(\pmb{x})= \frac{\partial f(\pmb{x})}{\partial \pmb{x}}= \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \cdots, \frac{\partial f}{\partial x_m} \right]^T \tag{2-5} ∇xf(xxx)=∂xxx∂f(xxx)=[∂x1∂f,∂x2∂f,⋯,∂xm∂f]T(2-5)
式(2-5)又称为列向量偏导形式或列偏导向量形式,为了方便,下面统一称作梯度向量形式。可以看到式(2-4)和(2-5)互为转置。
设 f ( x ) = [ f 1 , f 2 , ⋯ , f n ] T , x = [ x 1 , x 2 , ⋯ , x m ] T f(\pmb{x}) = [f_1,f_2,\cdots,f_n]^T,\pmb{x}=[x_1,x_2,\cdots,x_m]^T f(xxx)=[f1,f2,⋯,fn]T,xxx=[x1,x2,⋯,xm]T
分子布局(Jacobian 矩阵形式),就是分子是列向量形式,分母是行向量形式,即:
∂ f n × 1 ( x ) ∂ x m × 1 T = [ ∂ f 1 ∂ x 1 ∂ f 1 ∂ x 2 ⋯ ∂ f 1 ∂ x m ∂ f 2 ∂ x 1 ∂ f 2 ∂ x 2 ⋯ ∂ f 2 ∂ x m ⋮ ⋮ ⋮ ⋮ ∂ f n ∂ x 1 ∂ f n ∂ x 2 ⋯ ∂ f n ∂ x m ] n × m (2-6) \frac{\partial \pmb{f}_{n\times1}(\pmb{x})}{\partial \pmb{x}^T_{m\times1}}= \begin{bmatrix} \frac{\partial f_1}{\partial x_{1}} & \frac{\partial f_1}{\partial x_{2}} & \cdots&\frac{\partial f_1}{\partial x_{m}} \\ \quad \\ \frac{\partial f_2}{\partial x_{1}} & \frac{\partial f_2}{\partial x_{2}} & \cdots & \frac{\partial f_2}{\partial x_{m}}\\ \vdots & \vdots & \vdots & \vdots\\ \frac{\partial f_n} {\partial x_{1}} & \frac{\partial f_n}{\partial x_{2}} & \cdots &\frac{\partial f_n}{\partial x_{m}} \end{bmatrix}_{n \times m} \tag{2-6} ∂xxxm×1T∂fffn×1(xxx)=⎣⎢⎢⎢⎢⎢⎢⎡∂x1∂f1∂x1∂f2⋮∂x1∂fn∂x2∂f1∂x2∂f2⋮∂x2∂fn⋯⋯⋮⋯∂xm∂f1∂xm∂f2⋮∂xm∂fn⎦⎥⎥⎥⎥⎥⎥⎤n×m(2-6)
分母布局(梯度矩阵形式),就是分母是列向量形式,分子是行向量形式,即:
∂ f n × 1 T ( x ) ∂ x m × 1 = [ ∂ f 1 ∂ x 1 ∂ f 2 ∂ x 1 ⋯ ∂ f n ∂ x 1 ∂ f 1 ∂ x 2 ∂ f 2 ∂ x 2 ⋯ ∂ f n ∂ x 2 ⋮ ⋮ ⋮ ⋮ ∂ f 1 ∂ x m ∂ f 2 ∂ x m ⋯ ∂ f n ∂ x m ] m × n (2-7) \frac{\partial \pmb{f}_{n\times1}^T(\pmb{x})}{\partial \pmb{x}_{m\times1}}= \begin{bmatrix} \frac{\partial f_1}{\partial x_{1}} & \frac{\partial f_2}{\partial x_{1}} & \cdots&\frac{\partial f_n}{\partial x_{1}} \\ \quad \\ \frac{\partial f_1}{\partial x_{2}} & \frac{\partial f_2}{\partial x_{2}} & \cdots & \frac{\partial f_n}{\partial x_{2}}\\ \vdots & \vdots & \vdots & \vdots\\ \frac{\partial f_1} {\partial x_{m}} & \frac{\partial f_2}{\partial x_{m}} & \cdots &\frac{\partial f_n}{\partial x_{m}} \end{bmatrix}_{m \times n} \tag{2-7} ∂xxxm×1∂fffn×1T(xxx)=⎣⎢⎢⎢⎢⎢⎢⎡∂x1∂f1∂x2∂f1⋮∂xm∂f1∂x1∂f2∂x2∂f2⋮∂xm∂f2⋯⋯⋮⋯∂x1∂fn∂x2∂fn⋮∂xm∂fn⎦⎥⎥⎥⎥⎥⎥⎤m×n(2-7)
设 f ( X ) , X m × n = ( x i j ) i = 1 , j = 1 m , n f(\pmb{X}),\pmb{X}_{m\times n}=(x_{ij})_{i=1,j=1}^{m,n} f(XXX),XXXm×n=(xij)i=1,j=1m,n
这里引入一个新的概念,矩阵 X ∈ R m × n \pmb{X} \in \mathbb{R}^{m \times n} XXX∈Rm×n 的向量化 vec ( X ) \text{vec}({\pmb{X})} vec(XXX) 是一线性变换,作用是将矩阵 X \pmb{X} XXX 按列堆栈来向量化。换言之, vec ( X ) \text{vec}({\pmb{X})} vec(XXX) 就是把矩阵 X \pmb{X} XXX 的第 1 1 1 列,第 2 2 2 列,直到第 n n n 列取出来,然后按顺序组成一个 m n × 1 mn \times 1 mn×1 的列向量,即:
vec ( X ) = [ x 11 , x 21 , ⋯ , x m 1 , x 12 , x 22 , ⋯ , x m 2 , ⋯ , x 1 n , x 2 n , ⋯ , x m n ] T (2-8) \text{vec}({\pmb{X})}= \left[ x_{11},x_{21},\cdots,x_{m1},x_{12},x_{22},\cdots,x_{m2},\cdots,x_{1n},x_{2n},\cdots,x_{mn} \right]^T \tag{2-8} vec(XXX)=[x11,x21,⋯,xm1,x12,x22,⋯,xm2,⋯,x1n,x2n,⋯,xmn]T(2-8)
矩阵 X \pmb{X} XXX 也可以按行堆栈来向量化,称为矩阵的行向量化,用符号 r v e c ( X ) rvec(\pmb{X}) rvec(XXX) 表示,定义为: rvec ( X ) = [ x 11 , x 12 , ⋯ , x 1 n , x 21 , x 22 , ⋯ , x 2 n , ⋯ , x m 1 , x m 2 , ⋯ , x m n ] \text{rvec}({\pmb{X})}= \left[ x_{11},x_{12},\cdots,x_{1n},x_{21},x_{22},\cdots,x_{2n},\cdots,x_{m1},x_{m2},\cdots,x_{mn} \right] rvec(XXX)=[x11,x12,⋯,x1n,x21,x22,⋯,x2n,⋯,xm1,xm2,⋯,xmn]
矩阵的向量化和行向量化之间存在下列关系:
rvec ( X ) = ( vec ( X T ) ) T , vec ( X T ) = ( rvec ( X ) ) T \text{rvec}({\pmb{X})} = (\text{vec}({\pmb{X}^T)})^T,\text{vec}({\pmb{X}^T)} = (\text{rvec}({\pmb{X})})^T rvec(XXX)=(vec(XXXT))T,vec(XXXT)=(rvec(XXX))T
1. 行向量偏导形式
先把矩阵变元 X \pmb{X} XXX 按 v e c vec vec 向量化,转换成向量变元,再对该向量变元使用式(2-2):
D vec X f ( X ) = ∂ f ( X ) ∂ vec T ( X ) = [ ∂ f ∂ x 11 , ∂ f ∂ x 21 , ⋯ , ∂ f ∂ x m 1 , ∂ f ∂ x 12 , ∂ f ∂ x 22 , ⋯ , ∂ f ∂ x m 2 , ⋯ , ∂ f ∂ x 1 n , ∂ f ∂ x 2 n , ⋯ , ∂ f ∂ x m n ] (2-9) \begin{aligned} \text{D}_{\text{vec}\boldsymbol{X}}f(\pmb{X})&= \frac{\partial f(\pmb{X})}{\partial \text{vec}^T(\pmb{X})} \\\\ &= \left[ \frac{\partial f}{\partial x_{11}},\frac{\partial f}{\partial x_{21}},\cdots,\frac{\partial f}{\partial x_{m1}},\frac{\partial f}{\partial x_{12}},\frac{\partial f}{\partial x_{22}},\cdots,\frac{\partial f}{\partial x_{m2}},\cdots,\frac{\partial f} {\partial x_{1n}},\frac{\partial f}{\partial x_{2n}},\cdots,\frac{\partial f}{\partial x_{mn}} \right] \end{aligned} \tag{2-9} DvecXf(XXX)=∂vecT(XXX)∂f(XXX)=[∂x11∂f,∂x21∂f,⋯,∂xm1∂f,∂x12∂f,∂x22∂f,⋯,∂xm2∂f,⋯,∂x1n∂f,∂x2n∂f,⋯,∂xmn∂f](2-9)
2. Jacobian 矩阵形式
先把矩阵变元 X \boldsymbol{X} X 进行转置,再对转置后的每个位置的元素逐个求偏导,结果布局和转置布局一样。
D X f ( X ) = ∂ f ( X ) ∂ X m × n T = [ ∂ f ∂ x 11 ∂ f ∂ x 21 ⋯ ∂ f ∂ x m 1 ∂ f ∂ x 12 ∂ f ∂ x 22 ⋯ ∂ f ∂ x m 2 ⋮ ⋮ ⋮ ⋮ ∂ f ∂ x 1 n ∂ f ∂ x 2 n ⋯ ∂ f ∂ x m n ] n × m (2-10) \begin{aligned} \text{D}_{\boldsymbol{X}}f(\pmb{X})&= \frac{\partial f(\pmb{X})}{\partial \pmb{X}^T_{m\times n}} &= \begin{bmatrix} \frac{\partial f}{\partial x_{11}}&\frac{\partial f}{\partial x_{21}}&\cdots&\frac{\partial f}{\partial x_{m1}} \\ \\ \quad \frac{\partial f}{\partial x_{12}}&\frac{\partial f}{\partial x_{22}}& \cdots & \frac{\partial f}{\partial x_{m2}}\\ \vdots&\vdots&\vdots&\vdots\\ \frac{\partial f} {\partial x_{1n}}&\frac{\partial f}{\partial x_{2n}}&\cdots&\frac{\partial f}{\partial x_{mn}}\end{bmatrix}_{n\times m} \end{aligned} \tag{2-10} DXf(XXX)=∂XXXm×nT∂f(XXX)=⎣⎢⎢⎢⎢⎢⎢⎡∂x11∂f∂x12∂f⋮∂x1n∂f∂x21∂f∂x22∂f⋮∂x2n∂f⋯⋯⋮⋯∂xm1∂f∂xm2∂f⋮∂xmn∂f⎦⎥⎥⎥⎥⎥⎥⎤n×m(2-10)
3. 梯度向量形式
先把矩阵变元 X \boldsymbol{X} X 按 v e c vec vec 向量化,转换成向量变元,再对该变元使用式(2-5):
∇ vec X f ( X ) = ∂ f ( X ) ∂ vec X = [ ∂ f ∂ x 11 , ∂ f ∂ x 21 , ⋯ , ∂ f ∂ x m 1 , ∂ f ∂ x 12 , ∂ f ∂ x 22 , ⋯ , ∂ f ∂ x m 2 , ⋯ , ∂ f ∂ x 1 n , ∂ f ∂ x 2 n , ⋯ , ∂ f ∂ x m n ] T (2-11) \begin{aligned} \nabla_{\text{vec}\boldsymbol{X}}f(\pmb{X})&= \frac{\partial f(\pmb{X})}{\partial \text{vec}\pmb{X}} \\\\ &= \left[ \frac{\partial f}{\partial x_{11}},\frac{\partial f}{\partial x_{21}},\cdots,\frac{\partial f}{\partial x_{m1}},\frac{\partial f}{\partial x_{12}},\frac{\partial f}{\partial x_{22}},\cdots,\frac{\partial f}{\partial x_{m2}},\cdots,\frac{\partial f} {\partial x_{1n}},\frac{\partial f}{\partial x_{2n}},\cdots,\frac{\partial f}{\partial x_{mn}} \right]^T \end{aligned} \tag{2-11} ∇vecXf(XXX)=∂vecXXX∂f(XXX)=[∂x11∂f,∂x21∂f,⋯,∂xm1∂f,∂x12∂f,∂x22∂f,⋯,∂xm2∂f,⋯,∂x1n∂f,∂x2n∂f,⋯,∂xmn∂f]T(2-11)
4. 梯度矩阵形式
直接对原矩阵变元 X \boldsymbol{X} X 的每个位置的元素逐个求偏导,结果布局和原矩阵布局一样。
∇ X f ( X ) = ∂ f ( X ) ∂ X m × n = [ ∂ f ∂ x 11 ∂ f ∂ x 12 ⋯ ∂ f ∂ x 1 n ∂ f ∂ x 21 ∂ f ∂ x 22 ⋯ ∂ f ∂ x 2 n ⋮ ⋮ ⋮ ⋮ ∂ f ∂ x m 1 ∂ f ∂ x m 2 ⋯ ∂ f ∂ x m n ] m × n (2-12) \begin{aligned} \nabla_{\boldsymbol{X}}f(\pmb{X}) &= \frac{\partial f(\pmb{X})}{\partial \pmb{X}_{m\times n}} &= \begin{bmatrix} \frac{\partial f}{\partial x_{11}} & \frac{\partial f}{\partial x_{12}} & \cdots&\frac{\partial f}{\partial x_{1n}} \\ \quad \\ \frac{\partial f}{\partial x_{21}} & \frac{\partial f}{\partial x_{22}} & \cdots & \frac{\partial f}{\partial x_{2n}}\\ \vdots & \vdots & \vdots & \vdots\\ \frac{\partial f} {\partial x_{m1}} & \frac{\partial f}{\partial x_{m2}} & \cdots &\frac{\partial f}{\partial x_{mn}} \end{bmatrix}_{m\times n} \end{aligned} \tag{2-12} ∇Xf(XXX)=∂XXXm×n∂f(XXX)=⎣⎢⎢⎢⎢⎢⎢⎡∂x11∂f∂x21∂f⋮∂xm1∂f∂x12∂f∂x22∂f⋮∂xm2∂f⋯⋯⋮⋯∂x1n∂f∂x2n∂f⋮∂xmn∂f⎦⎥⎥⎥⎥⎥⎥⎤m×n(2-12)
小结: 由上面的定义,我们可以发现当矩阵变元 X \boldsymbol{X} X 本身就是一个列向量 x = [ x 1 , x 2 , ⋯ , x n ] T \pmb{x}=[x_1,x_2,\cdots,x_n]^T xxx=[x1,x2,⋯,xn]T 时,式(2-4)、(2-9)、(2-10)相等,式(2-5)、(2-11)、(2-12)相等。所以说,对于向量变元的实值标量函数 f ( x ) f(\pmb{x}) f(xxx), x = [ x 1 , x 2 , ⋯ , x n ] T \pmb{x}=[x_1,x_2,\cdots,x_n]^T xxx=[x1,x2,⋯,xn]T,结果布局本质上有两种形式,一种是Jacobian
矩阵(已经成行向量了)形式,一种是梯度矩阵(已经成列向量了)形式。两种形式互为转置。一句话概括,就是向量变元的实值标量函数是矩阵变元的实值标量函数的特例(变元由矩阵变为向量)。
设 F ( X ) , X m × n = ( x i j ) i = 1 , j = 1 m , n , F p × q = ( f i j ) i = 1 , j = 1 p , q \pmb{F}(\pmb{X}),\pmb{X}_{m\times n}=(x_{ij})_{i=1,j=1}^{m,n},\pmb{F}_{p\times q}=(f_{ij})_{i=1,j=1}^{p,q} FFF(XXX),XXXm×n=(xij)i=1,j=1m,n,FFFp×q=(fij)i=1,j=1p,q
1. Jacobian 矩阵形式
同样先把矩阵变元 X \boldsymbol{X} X 按 v e c vec vec 向量化,转换成向量变元:
vec ( X ) = [ x 11 , x 21 , ⋯ , x m 1 , x 12 , x 22 , ⋯ , x m 2 , ⋯ , x 1 n , x 2 n , ⋯ , x m n ] T (2-13) \text{vec}({\pmb{X})}= \left[ x_{11},x_{21},\cdots,x_{m1},x_{12},x_{22},\cdots,x_{m2},\cdots,x_{1n},x_{2n},\cdots,x_{mn} \right]^T \tag{2-13} vec(XXX)=[x11,x21,⋯,xm1,x12,x22,⋯,xm2,⋯,x1n,x2n,⋯,xmn]T(2-13)
再把实矩阵函数 F \boldsymbol{F} F 按 v e c vec vec 向量化,转换成实向量函数:
vec ( F ( X ) ) = [ f 11 ( X ) , f 21 ( X ) , ⋯ , f p 1 ( X ) , f 12 ( X ) , f 22 ( X ) , ⋯ , f p 2 ( X ) , ⋯ , f 1 q ( X ) , f 2 q ( X ) , ⋯ , f p q ( X ) ] T (2-14) \text{vec}({\pmb{F}(\pmb{X}))} = \left[ f_{11}(\pmb{X}),f_{21}(\pmb{X}),\cdots,f_{p1}(\pmb{X}),f_{12}(\pmb{X}),f_{22}(\pmb{X}),\cdots,f_{p2}(\pmb{X}),\cdots,f_{1q}(\pmb{X}),f_{2q}(\pmb{X}),\cdots,f_{pq}(\pmb{X}) \right]^T \tag{2-14} vec(FFF(XXX))=[f11(XXX),f21(XXX),⋯,fp1(XXX),f12(XXX),f22(XXX),⋯,fp2(XXX),⋯,f1q(XXX),f2q(XXX),⋯,fpq(XXX)]T(2-14)
这样,我们就把一个矩阵变元的实矩阵函数 F ( X ) \pmb{F}(\pmb{X}) FFF(XXX),转换成了向量变元的实向量函数 f ( X ) \pmb{f}(\pmb{X}) fff(XXX) 。接着,对照式(2-2)写出结果布局为 p q × m n pq \times mn pq×mn 的矩阵:
D X F ( X ) = ∂ vec p q × 1 ( F ( X ) ) ∂ vec m n × 1 T X = [ ∂ f 11 ∂ x 11 ∂ f 11 ∂ x 21 ⋯ ∂ f 11 ∂ x m 1 ∂ f 11 ∂ x 12 ∂ f 11 ∂ x 22 ⋯ ∂ f 11 ∂ x m 2 ⋯ ∂ f 11 ∂ x 1 n ∂ f 11 ∂ x 2 n ⋯ ∂ f 11 ∂ x m n ∂ f 21 ∂ x 11 ∂ f 21 ∂ x 21 ⋯ ∂ f 21 ∂ x m 1 ∂ f 21 ∂ x 12 ∂ f 21 ∂ x 22 ⋯ ∂ f 21 ∂ x m 2 ⋯ ∂ f 21 ∂ x 1 n ∂ f 21 ∂ x 2 n ⋯ ∂ f 21 ∂ x m n ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ∂ f p 1 ∂ x 11 ∂ f p 1 ∂ x 21 ⋯ ∂ f p 1 ∂ x m 1 ∂ f p 1 ∂ x 12 ∂ f p 1 ∂ x 22 ⋯ ∂ f p 1 ∂ x m 2 ⋯ ∂ f p 1 ∂ x 1 n ∂ f p 1 ∂ x 2 n ⋯ ∂ f p 1 ∂ x m n ∂ f 12 ∂ x 11 ∂ f 12 ∂ x 21 ⋯ ∂ f 12 ∂ x m 1 ∂ f 12 ∂ x 12 ∂ f 12 ∂ x 22 ⋯ ∂ f 12 ∂ x m 2 ⋯ ∂ f 12 ∂ x 1 n ∂ f 12 ∂ x 2 n ⋯ ∂ f 12 ∂ x m n ∂ f 22 ∂ x 11 ∂ f 22 ∂ x 21 ⋯ ∂ f 22 ∂ x m 1 ∂ f 22 ∂ x 12 ∂ f 22 ∂ x 22 ⋯ ∂ f 22 ∂ x m 2 ⋯ ∂ f 22 ∂ x 1 n ∂ f 22 ∂ x 2 n ⋯ ∂ f 22 ∂ x m n ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ∂ f p 2 ∂ x 11 ∂ f p 2 ∂ x 21 ⋯ ∂ f p 2 ∂ x m 1 ∂ f p 2 ∂ x 12 ∂ f p 2 ∂ x 22 ⋯ ∂ f p 2 ∂ x m 2 ⋯ ∂ f p 2 ∂ x 1 n ∂ f p 2 ∂ x 2 n ⋯ ∂ f p 2 ∂ x m n ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ∂ f 1 q ∂ x 11 ∂ f 1 q ∂ x 21 ⋯ ∂ f 1 q ∂ x m 1 ∂ f 1 q ∂ x 12 ∂ f 1 q ∂ x 22 ⋯ ∂ f 1 q ∂ x m 2 ⋯ ∂ f 1 q ∂ x 1 n ∂ f 1 q ∂ x 2 n ⋯ ∂ f 1 q ∂ x m n ∂ f 2 q ∂ x 11 ∂ f 2 q ∂ x 21 ⋯ ∂ f 2 q ∂ x m 1 ∂ f 2 q ∂ x 12 ∂ f 2 q ∂ x 22 ⋯ ∂ f 2 q ∂ x m 2 ⋯ ∂ f 2 q ∂ x 1 n ∂ f 2 q ∂ x 2 n ⋯ ∂ f 2 q ∂ x m n ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ∂ f p q ∂ x 11 ∂ f p q ∂ x 21 ⋯ ∂ f p q ∂ x m 1 ∂ f p q ∂ x 12 ∂ f p q ∂ x 22 ⋯ ∂ f p q ∂ x m 2 ⋯ ∂ f p q ∂ x 1 n ∂ f p q ∂ x 2 n ⋯ ∂ f p q ∂ x m n ] p q × m n (2-15) \begin{aligned} \text{D}_{\boldsymbol{X}}\pmb{F}(\pmb{X}) &=\frac{\partial \text{vec}_{pq\times 1}(\pmb{F}_{}(\pmb{X}))}{\partial \text{vec}^T_{mn\times 1}\pmb{X}} \\\\ &= \begin{bmatrix} \frac{\partial f_{11}}{\partial x_{11}}&\frac{\partial f_{11}}{\partial x_{21}}&\cdots&\frac{\partial f_{11}}{\partial x_{m1}}&\frac{\partial f_{11}}{\partial x_{12}}&\frac{\partial f_{11}}{\partial x_{22}}&\cdots&\frac{\partial f_{11}}{\partial x_{m2}}&\cdots&\frac{\partial f_{11}}{\partial x_{1n}}&\frac{\partial f_{11}}{\partial x_{2n}}&\cdots&\frac{\partial f_{11}}{\partial x_{mn}}\\ \\ \quad \frac{\partial f_{21}}{\partial x_{11}}&\frac{\partial f_{21}}{\partial x_{21}}&\cdots&\frac{\partial f_{21}}{\partial x_{m1}}&\frac{\partial f_{21}}{\partial x_{12}}&\frac{\partial f_{21}}{\partial x_{22}}&\cdots&\frac{\partial f_{21}}{\partial x_{m2}}&\cdots&\frac{\partial f_{21}}{\partial x_{1n}}&\frac{\partial f_{21}}{\partial x_{2n}}&\cdots&\frac{\partial f_{21}}{\partial x_{mn}} \\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \frac{\partial f_{p1}}{\partial x_{11}}&\frac{\partial f_{p1}}{\partial x_{21}}&\cdots&\frac{\partial f_{p1}}{\partial x_{m1}}&\frac{\partial f_{p1}}{\partial x_{12}}&\frac{\partial f_{p1}}{\partial x_{22}}&\cdots&\frac{\partial f_{p1}}{\partial x_{m2}}&\cdots&\frac{\partial f_{p1}}{\partial x_{1n}}&\frac{\partial f_{p1}}{\partial x_{2n}}&\cdots&\frac{\partial f_{p1}}{\partial x_{mn}}\\ \\ \quad \frac{\partial f_{12}}{\partial x_{11}}&\frac{\partial f_{12}}{\partial x_{21}}&\cdots&\frac{\partial f_{12}}{\partial x_{m1}}&\frac{\partial f_{12}}{\partial x_{12}}&\frac{\partial f_{12}}{\partial x_{22}}&\cdots&\frac{\partial f_{12}}{\partial x_{m2}}&\cdots&\frac{\partial f_{12}}{\partial x_{1n}}&\frac{\partial f_{12}}{\partial x_{2n}}&\cdots&\frac{\partial f_{12}}{\partial x_{mn}}\\ \\ \quad \frac{\partial f_{22}}{\partial x_{11}}&\frac{\partial f_{22}}{\partial x_{21}}&\cdots&\frac{\partial f_{22}}{\partial x_{m1}}&\frac{\partial f_{22}}{\partial x_{12}}&\frac{\partial f_{22}}{\partial x_{22}}&\cdots&\frac{\partial f_{22}}{\partial x_{m2}}&\cdots&\frac{\partial f_{22}}{\partial x_{1n}}&\frac{\partial f_{22}}{\partial x_{2n}}&\cdots&\frac{\partial f_{22}}{\partial x_{mn}}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \frac{\partial f_{p2}}{\partial x_{11}}&\frac{\partial f_{p2}}{\partial x_{21}}&\cdots&\frac{\partial f_{p2}}{\partial x_{m1}}&\frac{\partial f_{p2}}{\partial x_{12}}&\frac{\partial f_{p2}}{\partial x_{22}}&\cdots&\frac{\partial f_{p2}}{\partial x_{m2}}&\cdots&\frac{\partial f_{p2}}{\partial x_{1n}}&\frac{\partial f_{p2}}{\partial x_{2n}}&\cdots&\frac{\partial f_{p2}}{\partial x_{mn}}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \frac{\partial f_{1q}}{\partial x_{11}}&\frac{\partial f_{1q}}{\partial x_{21}}&\cdots&\frac{\partial f_{1q}}{\partial x_{m1}}&\frac{\partial f_{1q}}{\partial x_{12}}&\frac{\partial f_{1q}}{\partial x_{22}}&\cdots&\frac{\partial f_{1q}}{\partial x_{m2}}&\cdots&\frac{\partial f_{1q}}{\partial x_{1n}}&\frac{\partial f_{1q}}{\partial x_{2n}}&\cdots&\frac{\partial f_{1q}}{\partial x_{mn}}\\ \\ \quad \frac{\partial f_{2q}}{\partial x_{11}}&\frac{\partial f_{2q}}{\partial x_{21}}&\cdots&\frac{\partial f_{2q}}{\partial x_{m1}}&\frac{\partial f_{2q}}{\partial x_{12}}&\frac{\partial f_{2q}}{\partial x_{22}}&\cdots&\frac{\partial f_{2q}}{\partial x_{m2}}&\cdots&\frac{\partial f_{2q}}{\partial x_{1n}}&\frac{\partial f_{2q}}{\partial x_{2n}}&\cdots&\frac{\partial f_{2q}}{\partial x_{mn}}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \frac{\partial f_{pq}}{\partial x_{11}}&\frac{\partial f_{pq}}{\partial x_{21}}&\cdots&\frac{\partial f_{pq}}{\partial x_{m1}}&\frac{\partial f_{pq}}{\partial x_{12}}&\frac{\partial f_{pq}}{\partial x_{22}}&\cdots&\frac{\partial f_{pq}}{\partial x_{m2}}&\cdots&\frac{\partial f_{pq}}{\partial x_{1n}}&\frac{\partial f_{pq}}{\partial x_{2n}}&\cdots&\frac{\partial f_{pq}}{\partial x_{mn}} \end{bmatrix}_{pq\times mn} \end{aligned} \tag{2-15} DXFFF(XXX)=∂vecmn×1TXXX∂vecpq×1(FFF(XXX))=⎣⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎡∂x11∂f11∂x11∂f21⋮∂x11∂fp1∂x11∂f12∂x11∂f22⋮∂x11∂fp2⋮∂x11∂f1q∂x11∂f2q⋮∂x11∂fpq∂x21∂f11∂x21∂f21⋮∂x21∂fp1∂x21∂f12∂x21∂f22⋮∂x21∂fp2⋮∂x21∂f1q∂x21∂f2q⋮∂x21∂fpq⋯⋯⋮⋯⋯⋯⋮⋯⋮⋯⋯⋮⋯∂xm1∂f11∂xm1∂f21⋮∂xm1∂fp1∂xm1∂f12∂xm1∂f22⋮∂xm1∂fp2⋮∂xm1∂f1q∂xm1∂f2q⋮∂xm1∂fpq∂x12∂f11∂x12∂f21⋮∂x12∂fp1∂x12∂f12∂x12∂f22⋮∂x12∂fp2⋮∂x12∂f1q∂x12∂f2q⋮∂x12∂fpq∂x22∂f11∂x22∂f21⋮∂x22∂fp1∂x22∂f12∂x22∂f22⋮∂x22∂fp2⋮∂x22∂f1q∂x22∂f2q⋮∂x22∂fpq⋯⋯⋮⋯⋯⋯⋮⋯⋮⋯⋯⋮⋯∂xm2∂f11∂xm2∂f21⋮∂xm2∂fp1∂xm2∂f12∂xm2∂f22⋮∂xm2∂fp2⋮∂xm2∂f1q∂xm2∂f2q⋮∂xm2∂fpq⋯⋯⋮⋯⋯⋯⋮⋯⋮⋯⋯⋮⋯∂x1n∂f11∂x1n∂f21⋮∂x1n∂fp1∂x1n∂f12∂x1n∂f22⋮∂x1n∂fp2⋮∂x1n∂f1q∂x1n∂f2q⋮∂x1n∂fpq∂x2n∂f11∂x2n∂f21⋮∂x2n∂fp1∂x2n∂f12∂x2n∂f22⋮∂x2n∂fp2⋮∂x2n∂f1q∂x2n∂f2q⋮∂x2n∂fpq⋯⋯⋮⋯⋯⋯⋮⋯⋮⋯⋯⋮⋯∂xmn∂f11∂xmn∂f21⋮∂xmn∂fp1∂xmn∂f12∂xmn∂f22⋮∂xmn∂fp2⋮∂xmn∂f1q∂xmn∂f2q⋮∂xmn∂fpq⎦⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎤pq×mn(2-15)
2. 梯度矩阵形式
同样先把矩阵变元 X \boldsymbol{X} X 按 v e c vec vec 向量化,转换成向量变元:
vec ( X ) = [ x 11 , x 21 , ⋯ , x m 1 , x 12 , x 22 , ⋯ , x m 2 , ⋯ , x 1 n , x 2 n , ⋯ , x m n ] T (2-16) \text{vec}({\pmb{X})}= \left[ x_{11},x_{21},\cdots,x_{m1},x_{12},x_{22},\cdots,x_{m2},\cdots,x_{1n},x_{2n},\cdots,x_{mn} \right]^T \tag{2-16} vec(XXX)=[x11,x21,⋯,xm1,x12,x22,⋯,xm2,⋯,x1n,x2n,⋯,xmn]T(2-16)
再把实矩阵函数 F \boldsymbol{F} F 按 v e c vec vec 向量化,转换成实向量函数: