定义一个向量算子 ∇ \nabla ∇(读作nabla或者del):
∇ = ∂ ∂ x e x ⃗ + ∂ ∂ y e y ⃗ + ∂ ∂ z e z ⃗ (1.1) \nabla= \frac{\partial}{\partial x} \vec{e_x} + \frac{\partial}{\partial y} \vec{e_y} + \frac{\partial}{\partial z} \vec{e_z} \tag{1.1} ∇=∂x∂ex+∂y∂ey+∂z∂ez(1.1)
该算子也叫哈密顿算子,其中 e x ⃗ , e y ⃗ 和 e z ⃗ \vec{e_x}, \vec{e_y}和\vec{e_z} ex,ey和ez分别是 X , Y , Z X, Y, Z X,Y,Z方向的单位向量用线性代数的风格表示为( T T T为转置):
∇ = [ ∂ ∂ x , ∂ ∂ y , ∂ ∂ z ] T (1.2) \nabla= [\frac{\partial}{\partial x}, \frac{\partial}{\partial y}, \frac{\partial}{\partial z}]^T \tag{1.2} ∇=[∂x∂,∂y∂,∂z∂]T(1.2)
d i v ( g r a d ( ψ ) ) = ∇ ⋅ ( ∇ ψ ) = ∇ T ( ∇ ψ ) = [ ∂ ∂ x , ∂ ∂ y , ∂ ∂ z ] [ ∂ ψ ∂ x ∂ ψ ∂ y ∂ ψ ∂ z ] = ∂ 2 ψ ∂ x 2 + ∂ 2 ψ ∂ y 2 + ∂ 2 ψ ∂ z 2 (1.7) \begin{aligned} div (grad(\psi)) & = \nabla \cdot (\nabla \psi) = \nabla^T (\nabla \psi) \\&= [\frac{\partial}{\partial x}, \frac{\partial}{\partial y}, \frac{\partial}{\partial z}] \left[ \begin{matrix} \frac{\partial \psi}{\partial x} \\ \frac{\partial \psi}{\partial y} \\ \frac{\partial \psi}{\partial z} \end{matrix} \right]\\ &= \frac{\partial^2 \psi}{\partial x^2} + \frac{\partial^2 \psi}{\partial y^2} + \frac{\partial^2 \psi}{\partial z^2} \end{aligned} \tag{1.7} div(grad(ψ))=∇⋅(∇ψ)=∇T(∇ψ)=[∂x∂,∂y∂,∂z∂]⎣⎢⎡∂x∂ψ∂y∂ψ∂z∂ψ⎦⎥⎤=∂x2∂2ψ+∂y2∂2ψ+∂z2∂2ψ(1.7)
令公式(1.7)等于0,就得到了 L a p l a c i a n 方 程 \color{red}Laplacian方程 Laplacian方程:
∂ 2 ψ ∂ x 2 + ∂ 2 ψ ∂ y 2 + ∂ 2 ψ ∂ z 2 = 0 (1.8) \frac{\partial^2 \psi}{\partial x^2} + \frac{\partial^2 \psi}{\partial y^2} + \frac{\partial^2 \psi}{\partial z^2} = 0 \tag{1.8} ∂x2∂2ψ+∂y2∂2ψ+∂z2∂2ψ=0(1.8)
r o t ( g r a d ( ψ ) ) = ∇ × ∇ ψ = ∣ e x ⃗ e y ⃗ e z ⃗ ∂ ∂ x ∂ ∂ y ∂ ∂ z ∂ ψ ∂ x ∂ ψ ∂ y ∂ ψ ∂ z ∣ = e x ⃗ ( ∂ 2 ψ ∂ y ∂ z − ∂ 2 ψ ∂ z ∂ y ) − e y ⃗ ( ∂ 2 ψ ∂ x ∂ z − ∂ 2 ψ ∂ z ∂ x ) + e z ⃗ ( ∂ 2 ψ ∂ x ∂ y − ∂ 2 ψ ∂ y ∂ x ) = 0 (1.9) \begin{aligned} rot(grad(\psi)) & = \nabla\times \nabla\psi = \left| \begin{matrix} \vec{e_x} & \vec{e_y} & \vec{e_z} \\ \frac{\partial}{\partial x} & \frac{\partial}{\partial y} & \frac{\partial}{\partial z} \\ \frac{\partial \psi}{\partial x} & \frac{\partial \psi}{\partial y} & \frac{\partial \psi}{\partial z} \end{matrix} \right| \\ &=\vec{e_x} (\frac{\partial^2 \psi}{\partial y \partial z} - \frac{\partial^2 \psi}{\partial z \partial y}) - \vec{e_y} (\frac{\partial^2 \psi}{\partial x \partial z} - \frac{\partial^2 \psi}{\partial z \partial x}) + \vec{e_z} (\frac{\partial^2 \psi}{\partial x \partial y} - \frac{\partial^2 \psi}{\partial y \partial x}) \\ &= \boldsymbol{0} \end{aligned} \tag{1.9} rot(grad(ψ))=∇×∇ψ=∣∣∣∣∣∣ex∂x∂∂x∂ψey∂y∂∂y∂ψez∂z∂∂z∂ψ∣∣∣∣∣∣=ex(∂y∂z∂2ψ−∂z∂y∂2ψ)−ey(∂x∂z∂2ψ−∂z∂x∂2ψ)+ez(∂x∂y∂2ψ−∂y∂x∂2ψ)=0(1.9)
r o t ( d i v ( f ⃗ ) ) = ∇ ⋅ ( ∇ × f ⃗ ) = ∇ T ( ∇ × f ⃗ ) = [ ∂ ∂ x , ∂ ∂ y , ∂ ∂ z ] [ ∂ f z ∂ y − ∂ f y ∂ z − ( ∂ f z ∂ x − ∂ f x ∂ z ) ∂ f y ∂ x − ∂ f x ∂ y ] = ( ∂ 2 f z ∂ y ∂ x − ∂ 2 f y ∂ z ∂ x ) − ( ∂ 2 f z ∂ x ∂ y − ∂ 2 f x ∂ z ∂ y ) + ( ∂ 2 f y ∂ x ∂ z − ∂ 2 f x ∂ y ∂ z ) = 0 (1.10) \begin{aligned} rot(div(\vec{f})) &= \nabla\cdot (\nabla\times \vec{f}) = \nabla^T (\nabla \times \vec{f}) \\ &= [\frac{\partial}{\partial x}, \frac{\partial}{\partial y}, \frac{\partial}{\partial z}] \left[ \begin{matrix} \frac{\partial f_z}{\partial y} - \frac{\partial f_y}{\partial z} \\ -(\frac{\partial f_z}{\partial x} - \frac{\partial f_x}{\partial z})\\ \frac{\partial f_y}{\partial x} - \frac{\partial f_x}{\partial y} \end{matrix} \right] \\ &= (\frac{\partial^2 f_z}{\partial y \partial x} - \frac{\partial^2 f_y }{\partial z \partial x}) - (\frac{\partial^2 f_z}{\partial x \partial y} - \frac{\partial^2 f_x}{\partial z \partial y}) + (\frac{\partial^2 f_y}{\partial x \partial z} - \frac{\partial^2 f_x}{\partial y \partial z}) \\ &= 0 \end{aligned} \tag{1.10} rot(div(f))=∇⋅(∇×f)=∇T(∇×f)=[∂x∂,∂y∂,∂z∂]⎣⎢⎡∂y∂fz−∂z∂fy−(∂x∂fz−∂z∂fx)∂x∂fy−∂y∂fx⎦⎥⎤=(∂y∂x∂2fz−∂z∂x∂2fy)−(∂x∂y∂2fz−∂z∂y∂2fx)+(∂x∂z∂2fy−∂y∂z∂2fx)=0(1.10)
KL散度的定义是建立在熵(Entropy)的基础上的。此处以离散随机变量为例,先给出熵的定义,再给定KL散度定义。
规定当 p ( x i ) = 0 p(x_i)=0 p(xi)=0时, p ( x i ) log p ( x i ) = 0 p(x_i)\log p(x_i)=0 p(xi)logp(xi)=0。
推导:
- 针对上述离散变量的概率分布 p ( x ) 、 q ( x ) p(x)、q(x) p(x)、q(x)而言,其交叉熵定义为:
H ( p , q ) = ∑ x p ( x ) log 1 q ( x ) = − ∑ x p ( x ) log q ( x ) \begin{aligned}H(p,q)=∑_xp(x)\log{\frac{1}{q(x)}}=−∑_xp(x)\log q(x)\end{aligned} H(p,q)=x∑p(x)logq(x)1=−x∑p(x)logq(x)
在信息论中,交叉熵可认为是对预测分布 q ( x ) q(x) q(x)用真实分布 p ( x ) p(x) p(x)来进行编码时所需要的信息量大小。- KL散度或相对熵可通过下式得出:
D K L ( p ∥ q ) = H ( p , q ) − H ( p ) = − ∑ x p ( x ) log q ( x ) − ∑ x − p ( x ) log p ( x ) = − ∑ x p ( x ) ( log q ( x ) − log p ( x ) ) = − ∑ x p ( x ) log q ( x ) p ( x ) \begin{aligned} D_{K L}(p \| q) &=H(p, q)-H(p) \\ &=-\sum_{x} p(x) \log q(x)-\sum_{x}-p(x) \log p(x) \\ &=-\sum_{x} p(x)(\log q(x)-\log p(x)) \\ &=-\sum_{x} p(x) \log \frac{q(x)}{p(x)} \end{aligned} DKL(p∥q)=H(p,q)−H(p)=−x∑p(x)logq(x)−x∑−p(x)logp(x)=−x∑p(x)(logq(x)−logp(x))=−x∑p(x)logp(x)q(x)
KL散度可以用来衡量两个分布之间的差异,其具有如下数学性质:
D K L ( p ∣ ∣ q ) ≥ 0 (2.3) \color{red}D_{KL}(p||q)≥0\tag{2.3} DKL(p∣∣q)≥0(2.3)
可用Gibbs 不等式直接得出。先给出 G i b b s 不 等 式 \color{blue}Gibbs不等式 Gibbs不等式的内容:
若 ∑ i = 1 n p i = ∑ i = 1 n q i = 1 ∑^n_{i=1}p_i=∑^n_{i=1}q_i=1 ∑i=1npi=∑i=1nqi=1,且 p i , q i ∈ ( 0 , 1 ] p_i,q_i∈(0,1] pi,qi∈(0,1],则有:
− ∑ i n p i log p i ≤ − ∑ i n p i log q i (2.4) −∑_i^n{p_i}\log p_i≤−∑_i^n{p_i}\log q_i\tag{2.4} −i∑npilogpi≤−i∑npilogqi(2.4)
当且仅当 p i = q i ( ∀ i ) p_i=q_i(∀i) pi=qi(∀i)等号成立。
- 其中信息量的单位随着计算公式中log运算的底数而变化。
- log底数为2:单位为比特(
bit
)- log底数为e:单位为奈特(
nat
)- 参考阅读:
- 英文版:Kullback-Leibler Divergence Explained
- 英文版中文翻译: 解释Kullback-Leibler散度
其中运用到的一些矩阵等式:
- E p ( ⋅ ) Ep(⋅) Ep(⋅)代表⋅在概率密度函数 p ( x ) p(x) p(x)的期望。多元正态分布下期望矩阵化的表示
E ( x T A x ) = t r ( A Σ ) + μ T A μ (2.14) \color{blue}E(x^TAx)=tr(AΣ)+μ^TAμ\tag{2.14} E(xTAx)=tr(AΣ)+μTAμ(2.14)- 矩阵的迹的性质
矩阵线性组合迹不变: tr ( α A + β B ) = α tr ( A ) + β tr ( B ) 矩阵转置迹不变: tr ( A ) = tr ( A T ) 两方阵相乘交换迹不变: tr ( A B ) = tr ( B A ) 轮换不变性: tr ( A B C ) = tr ( B C A ) = tr ( C A B ) (2.15) \color{blue}\begin{array}{l} \text { 矩阵线性组合迹不变: } \operatorname{tr}(\alpha A+\beta B)=\alpha \operatorname{tr}(A)+\beta \operatorname{tr}(B)\\ \text { 矩阵转置迹不变: } \operatorname{tr}(A)=\operatorname{tr}\left(A^{T}\right)\\ \text { 两方阵相乘交换迹不变: } \operatorname{tr}(A B)=\operatorname{tr}(B A)\\ \text { 轮换不变性: } \operatorname{tr}(A B C)=\operatorname{tr}(B C A)=\operatorname{tr}(C A B) \end{array}\tag{2.15} 矩阵线性组合迹不变: tr(αA+βB)=αtr(A)+βtr(B) 矩阵转置迹不变: tr(A)=tr(AT) 两方阵相乘交换迹不变: tr(AB)=tr(BA) 轮换不变性: tr(ABC)=tr(BCA)=tr(CAB)(2.15)- 对于列向量 λ \lambda λ, λ T A λ \lambda^TA\lambda λTAλ的结果是一个标量,而标量的迹就是这个标量,即 t r ( λ T A λ ) = λ T A λ tr(\lambda^TA\lambda)=\lambda^TA\lambda tr(λTAλ)=λTAλ,因此
λ T A λ = t r ( λ T A λ ) = t r ( A λ λ T ) (2.16) \color{blue}\lambda^TA\lambda=tr(\lambda^TA\lambda)=tr(A\lambda\lambda^T)\tag{2.16} λTAλ=tr(λTAλ)=tr(AλλT)(2.16)- 多元高斯分布中均值 μ μ μ和方差 Σ Σ Σ的性质:
E [ x x T ] = Σ + μ μ T (2.17) \color{blue}E[xx^T]=Σ+μμ^T\tag{2.17} E[xxT]=Σ+μμT(2.17)
E ( x T A x ) = t r ( A Σ ) + μ T A μ (2.18) \color{blue}E(x^TAx)=tr(AΣ)+μ^TAμ\tag{2.18} E(xTAx)=tr(AΣ)+μTAμ(2.18)
因此:
D K L ( p ∥ q ) = 1 2 log ∣ Σ 2 ∣ ∣ Σ 1 ∣ + 1 2 E p ( x ) [ ( x − μ 2 ) T Σ 2 − 1 ( x − μ 2 ) − ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ] = 1 2 log ∣ Σ 2 ∣ ∣ Σ 1 ∣ + 1 2 tr ( Σ 2 − 1 Σ 1 ) + ( μ 1 − μ 2 ) T Σ 2 − 1 ( μ 1 − μ 2 ) T − 1 2 N (2.19) \color{red}\begin{aligned} D_{K L}(p \| q)=& \frac{1}{2} \log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|}+\frac{1}{2} E_{p(x)}\left[\left(x-\mu_{2}\right)^{T} \Sigma_{2}^{-1}\left(x-\mu_{2}\right)-\left(x-\mu_{1}\right)^{T} \Sigma_{1}^{-1}\left(x-\mu_{1}\right)\right] \\ =& \frac{1}{2} \log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|}+\frac{1}{2} \operatorname{tr}\left(\Sigma_{2}^{-1} \Sigma_{1}\right)+\left(\mu_{1}-\mu_{2}\right)^{T} \Sigma_{2}^{-1}\left(\mu_{1}-\mu_{2}\right)^{T}-\frac{1}{2} N\end{aligned}\tag{2.19} DKL(p∥q)==21log∣Σ1∣∣Σ2∣+21Ep(x)[(x−μ2)TΣ2−1(x−μ2)−(x−μ1)TΣ1−1(x−μ1)]21log∣Σ1∣∣Σ2∣+21tr(Σ2−1Σ1)+(μ1−μ2)TΣ2−1(μ1−μ2)T−21N(2.19)