在概率论或信息论中,KL散度(Kullback-Leibler divergence), 又称为相对熵(relative entropy),是藐视两个概率分布 P 和 Q 之间差异的一种方法。KL散度是非对称的,即 D(P || Q) ≠ D(Q || P)。在信息论中,D(P || Q)表示当用概率分布Q来拟合真是分布P时,产生的信息损耗,其中P表示真是分布,Q表示P的拟合分布。
有人将KL散度称为KL距离,但事实上,KL散度并不满足距离的概念,因为:
1)KL散度是非对称的
2)KL散度不满足三角不等式
对一个离散随机变量或连续随机变量的两个概率分布P和Q来说,KL散度的定义分别如下:
Discrete random variable
D ( P ∣ ∣ Q ) = ∑ i ∈ X P ( i ) ∗ [ l o g ( P ( i ) Q ( i ) ) ] D(P||Q)=\sum\limits_{i\in X}P(i)*\left[log\left(\frac{P(i)}{Q(i)}\right)\right] D(P∣∣Q)=i∈X∑P(i)∗[log(Q(i)P(i))]
Continuous random variable
D ( P ∣ ∣ Q ) = ∫ x P ( x ) ∗ [ l o g ( P ( x ) Q ( x ) ) ] d x D(P||Q)=\int_{x}P(x)*\left[log\left(\frac{P(x)}{Q(x)}\right)\right]dx D(P∣∣Q)=∫xP(x)∗[log(Q(x)P(x))]dx
在信息论中,KL散度的物理意义:
信息量
信息奠基人香农(Shannon)认为“信息是用来消除随机不确定性的东西”,也就是说衡量信息量的大小就是看这个信息消除不确定性的程度。
信息量的大小与信息发生的概率成反比。概率越大,信息量越小。概率越小,信息量越大。
设某一事件发生的概率为P(x),其信息量表示为: I ( x ) = − l o g ( P ( x ) ) = l o g ( 1 P ( x ) ) \mathrm{I(x)} = − log(P(x))=log\left(\frac{1}{P(x)}\right) I(x)=−log(P(x))=log(P(x)1)
其中 I ( x ) \mathrm{I}(\mathrm{x}) I(x)表示信息量,这里 l o g log log表示以e为底的自然对数。
KL散度在信息论中有自己明确的物理意义,它是用来度量使用基于Q分布的编码来编码来自P分布的样本平均所需的额外的Bit个数。而其在机器学习领域的物理意义则是用来度量两个函数的相似程度或者相近程度,在泛函分析中也被频繁地用到。
下面式子中 绿色 {\color{green}绿色} 绿色和 红色 {\color{red}红色} 红色部分就表示 信息量。
在香农信息论中,用基于P的编码去编码来自P的样本,其最优编码平均所需要的比特个数(即这个字符集的熵)为:
H ( x ) = ∑ x ∈ X P ( x ) ⏟ P 中各字符出现的频率 ∗ l o g ( 1 P ( x ) ) ⏟ P 中此字符对应的编码长度 H(x)=\sum_{x\in X}{\color{blue}\underbrace{P(x)}_{P中各字符出现的频率} }*{\color{green}\underbrace{ log\left(\frac{1}{P(x)}\right)}_{P中此字符对应的编码长度}} H(x)=x∈X∑P中各字符出现的频率 P(x)∗P中此字符对应的编码长度 log(P(x)1)
用基于P的编码去编码来自Q的样本,则所需要的比特个数变为:
H ′ ( x ) = ∑ x ∈ X P ( x ) ⏟ P 中各字符出现的频率 ∗ l o g ( 1 Q ( x ) ) ⏟ 此时各字符来自 Q ,各字符编码长度对应于 Q 的分布,与 P 无关 H^{\prime}(x)=\sum_{x\in X}{\color{blue}\underbrace{P(x)}_{P中各字符出现的频率} }*{\color{red}\underbrace{ log\left(\frac{1}{Q(x)}\right)}_{此时各字符来自Q,各字符编码长度对应于Q的分布,与P无关}} H′(x)=x∈X∑P中各字符出现的频率 P(x)∗此时各字符来自Q,各字符编码长度对应于Q的分布,与P无关 log(Q(x)1)
于是,可以得出P与Q的KL散度:
D ( P ∣ ∣ Q ) = H ′ ( x ) − H ( x ) = ∑ x ∈ X P ( x ) ∗ l o g ( 1 Q ( x ) ) − ∑ x ∈ X P ( x ) ∗ l o g ( 1 P ( x ) ) = ∑ x ∈ X P ( x ) ∗ l o g ( P ( x ) Q ( x ) ) \begin{aligned} D(P||Q)=&H^{\prime}(x)-H(x)=\underset{x\in X}{\sum}P(x)*log(\frac{1}{Q(x)})-\underset{x\in X}{\sum}P(x)*log(\frac{1}{P(x)})\\ =&\underset{x\in X}{\sum}P(x)*log(\frac{P(x)}{Q(x)}) \end{aligned} D(P∣∣Q)==H′(x)−H(x)=x∈X∑P(x)∗log(Q(x)1)−x∈X∑P(x)∗log(P(x)1)x∈X∑P(x)∗log(Q(x)P(x))
我的另一篇博客___KL散度详解
参考KL散度的含义与性质
对于两个单一连续变量的高斯分布 P ( x ) ∼ N ( μ 1 , σ 1 2 ) , Q ( x ) ∼ N ( μ 2 , σ 2 2 ) P(x)\sim \mathcal N(\mu_{1},\sigma_{1}^{2}),Q(x)\sim \mathcal N(\mu_{2},\sigma_{2}^{2}) P(x)∼N(μ1,σ12),Q(x)∼N(μ2,σ22).
由连续随机变量的KL散度定义得:
K L ( P ∣ ∣ Q ) = K L ( N ( μ 1 , σ 1 2 ) ∣ ∣ N ( μ 2 , σ 2 2 ) = ∫ x 1 σ 1 2 π e − ( x − μ 1 ) 2 2 σ 1 2 l o g ( 1 σ 1 2 π e − ( x − μ 1 ) 2 2 σ 1 2 1 σ 2 2 π e − ( x − μ 2 ) 2 2 σ 2 2 ) d x = ∫ x 1 σ 1 2 π e − ( x − μ 1 ) 2 2 σ 1 2 [ l o g σ 2 σ 1 − ( x − μ 1 ) 2 2 σ 1 2 + ( x − μ 2 ) 2 2 σ 2 2 ] d x \begin{aligned} KL(P||Q)=&KL(\mathcal N(\mu_{1},\sigma_{1}^{2})||\mathcal N(\mu_{2},\sigma_{2}^{2})\\ \\ =&\int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}log\left(\frac{\frac{1}{\sigma_{1}\sqrt{2\pi}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}}{\frac{1}{\sigma_{2}\sqrt{2\pi}}e^{-\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}}}\right)dx\\ \\ =&\int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}\left[log\frac{\sigma_{2}}{\sigma_{1}}-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}+\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}\right]dx\\ \end{aligned} KL(P∣∣Q)===KL(N(μ1,σ12)∣∣N(μ2,σ22)∫xσ12π1e−2σ12(x−μ1)2log σ22π1e−2σ22(x−μ2)2σ12π1e−2σ12(x−μ1)2 dx∫xσ12π1e−2σ12(x−μ1)2[logσ1σ2−2σ12(x−μ1)2+2σ22(x−μ2)2]dx
把上式分为3项来分别求解:
第一项:
l o g σ 2 σ 1 ∫ x 1 σ 1 2 π e − ( x − μ 1 ) 2 2 σ 1 2 d x = l o g σ 2 σ 1 log\frac{\sigma_{2}}{\sigma_{1}}\int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx=log\frac{\sigma_{2}}{\sigma_{1}} logσ1σ2∫xσ12π1e−2σ12(x−μ1)2dx=logσ1σ2
第二项需要分辨出积分项为方差:
− 1 σ 1 2 π ∫ x ( x − μ 1 ) 2 2 σ 1 2 e − ( x − μ 1 ) 2 2 σ 1 2 d x = − 1 σ 1 2 π ∫ x ( x − μ 1 σ 1 2 ) 2 e − ( x − μ 1 σ 1 2 ) 2 d x = − 1 π ∫ x ( x − μ 1 σ 1 2 ) 2 e − ( x − μ 1 σ 1 2 ) 2 d ( x − μ 1 σ 1 2 ) = − 1 π ∫ t t 2 . e − t 2 d t = − 1 π . π 2 = − 1 2 \begin{aligned} -\frac{1}{\sigma_{1}\sqrt{2\pi}}\int_{x}\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx =&-\frac{1}{\sigma_{1}\sqrt{2\pi}}\int_{x}\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)^{2}e^{-\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)^{2}}dx\\ =&-\frac{1}{\sqrt{\pi}}\int_{x}\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)^{2}e^{-\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)^{2}}d\left(\frac{x-\mu_{1}}{\sigma_{1}\sqrt{2}}\right)\\ =&-\frac{1}{\sqrt{\pi}}\int_{t}t^{2}.e^{-t^{2}}dt\\ =&-\frac{1}{\sqrt{\pi}}.\frac{\sqrt{\pi}}{2}\\ =&-\frac{1}{2} \end{aligned} −σ12π1∫x2σ12(x−μ1)2e−2σ12(x−μ1)2dx=====−σ12π1∫x(σ12x−μ1)2e−(σ12x−μ1)2dx−π1∫x(σ12x−μ1)2e−(σ12x−μ1)2d(σ12x−μ1)−π1∫tt2.e−t2dt−π1.2π−21
− − − − − − − − − − − − − − − 推导过程如下: − − − − − − − − − − − − − − − − − − ---------------推导过程如下:------------------ −−−−−−−−−−−−−−−推导过程如下:−−−−−−−−−−−−−−−−−−
∫ − ∞ + ∞ x 2 . e − x 2 d x = 2 ∫ 0 + ∞ x 2 . e − x 2 d x = t = x 2 ∫ 0 + ∞ t . e − t d t \int_{-\infty}^{+\infty}x^{2}.e^{-x^{2}}dx=2\int_{0}^{+\infty}x^{2}.e^{-x^{2}}dx\xlongequal{t=x^{2}}\int_{0}^{+\infty}\sqrt{t}.e^{-t}dt ∫−∞+∞x2.e−x2dx=2∫0+∞x2.e−x2dxt=x2∫0+∞t.e−tdt
Γ \Gamma Γ函数如下:
Γ ( s ) = ∫ 0 + ∞ x s − 1 . e − x d x \Gamma(s) = \int_{0}^{+\infty}x^{s-1}.e^{-x}dx Γ(s)=∫0+∞xs−1.e−xdx
Γ \Gamma Γ函数的性质有:
Γ ( s + 1 ) = s Γ ( s ) \Gamma(s+1) = s\Gamma(s) Γ(s+1)=sΓ(s)
Γ ( 1 ) = 1 Γ ( 1 2 ) = π Γ ( n + 1 ) = n ! \Gamma(1)=1\quad\quad\Gamma(\frac{1}{2})=\sqrt{\pi}\quad\quad\Gamma(n+1)=n! Γ(1)=1Γ(21)=πΓ(n+1)=n!
Γ ( 3 2 ) = ∫ 0 + ∞ x . e − x d x = Γ ( 1 2 + 1 ) = 1 2 . Γ ( 1 2 ) = π 2 \Gamma(\frac{3}{2})=\int_{0}^{+\infty}\sqrt{x}.e^{-x}dx=\Gamma(\frac{1}{2}+1)=\frac{1}{2}.\Gamma(\frac{1}{2})=\frac{\sqrt{\pi}}{2} Γ(23)=∫0+∞x.e−xdx=Γ(21+1)=21.Γ(21)=2π
所以:
∫ − ∞ + ∞ x 2 . e − x 2 d x = π 2 \int_{-\infty}^{+\infty}x^{2}.e^{-x^{2}}dx=\frac{\sqrt{\pi}}{2} ∫−∞+∞x2.e−x2dx=2π
− − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − --------------------------------- −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
第三项的积分内部分别是均方值、均值和常数,因此可以得到:
∫ x 1 σ 1 2 π . ( x − μ 2 ) 2 2 σ 2 2 . e − ( x − μ 1 ) 2 2 σ 1 2 d x = 1 2 σ 1 σ 2 2 2 π ∫ x ( x 2 − 2 x μ 2 + μ 2 2 ) . e − ( x − μ 1 ) 2 2 σ 1 2 d x = σ 1 2 + μ 1 2 − 2 μ 1 μ 2 + μ 2 2 2 σ 2 2 = σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 \begin{aligned} \int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}.\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}.e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx=&\frac{1}{2\sigma_{1}\sigma_{2}^{2}\sqrt{2\pi}}\int_{x}\left(x^{2} -2x\mu_{2}+ \mu_{2}^{2}\right).e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx\\ =&\frac{\sigma_{1}^{2}+\mu_{1}^{2}-2\mu_{1}\mu_{2}+\mu_{2}^{2}}{2\sigma_{2}^{2}}\\ =&\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}} \end{aligned} ∫xσ12π1.2σ22(x−μ2)2.e−2σ12(x−μ1)2dx===2σ1σ222π1∫x(x2−2xμ2+μ22).e−2σ12(x−μ1)2dx2σ22σ12+μ12−2μ1μ2+μ222σ22σ12+(μ1−μ2)2
− − − − − − − − − − − − − 计算过程: − − − − − − − − − − − − − − − − -------------计算过程:---------------- −−−−−−−−−−−−−计算过程:−−−−−−−−−−−−−−−−
其中第一项为方差,第二项为奇函数全积分为0,第三项为常数可以提取为系数:
∫ x 1 σ 1 2 π . ( x − μ 2 ) 2 2 σ 2 2 . e − ( x − μ 1 ) 2 2 σ 1 2 d x = 1 2 σ 2 2 ∫ x [ ( x − μ 1 ) 2 + 2 ( μ 1 − μ 2 ) ( x − μ 1 ) + ( μ 1 − μ 2 ) 2 ] . 1 σ 1 2 π . e − ( x − μ 1 ) 2 2 σ 1 2 d x = σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 \begin{aligned} \int_{x}\frac{1}{\sigma_{1}\sqrt{2\pi}}.\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}.e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx=&\frac{1}{2\sigma_{2}^{2}}\int_{x}\left[(x-\mu_{1})^{2}+2(\mu_{1}-\mu_{2})(x-\mu_{1})+(\mu_{1}-\mu_{2})^{2}\right].\frac{1}{\sigma_{1}\sqrt{2\pi}}.e^{-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}}dx\\ =&\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}} \end{aligned} ∫xσ12π1.2σ22(x−μ2)2.e−2σ12(x−μ1)2dx==2σ221∫x[(x−μ1)2+2(μ1−μ2)(x−μ1)+(μ1−μ2)2].σ12π1.e−2σ12(x−μ1)2dx2σ22σ12+(μ1−μ2)2
− − − − − − − − − − − − − − − − − − − − − − − − − − − − − ----------------------------- −−−−−−−−−−−−−−−−−−−−−−−−−−−−−
整理最终结果得:
K L ( P ∣ ∣ Q ) = K L ( N ( μ 1 , σ 1 2 ) ∣ ∣ N ( μ 2 , σ 2 2 ) ) = l o g σ 2 σ 1 + σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 − 1 2 \begin{aligned} KL(P||Q)=&KL(\mathcal N(\mu_{1},\sigma_{1}^{2})||\mathcal N(\mu_{2},\sigma_{2}^{2}))\\ =&log\frac{\sigma_{2}}{\sigma_{1}}+\frac{\sigma_{1}^{2}+(\mu_{1}-\mu_{2})^{2}}{2\sigma_{2}^{2}}-\frac{1}{2} \end{aligned} KL(P∣∣Q)==KL(N(μ1,σ12)∣∣N(μ2,σ22))logσ1σ2+2σ22σ12+(μ1−μ2)2−21
x ∈ R d \mathrm{x}\in \mathbb{R}^{d} x∈Rd
N ( x ∣ μ , Σ ) = 1 ( 2 π ) d 2 ∣ Σ ∣ 1 2 e − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) \mathcal{N}(\mathrm{x} \mid \mu, \Sigma)=\frac{1}{(2 \pi)^{\frac{d}{2}}|\Sigma|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}(\mathrm{x}-\mu)^{\mathrm{T}} \Sigma^{-1}(\mathrm{x}-\mu)} N(x∣μ,Σ)=(2π)2d∣Σ∣211e−21(x−μ)TΣ−1(x−μ)
P ( x ) ∼ N ( x ∣ μ 1 , Σ 1 ) Q ( x ) ∼ N ( x ∣ μ 2 , Σ 2 ) P(\mathrm{x})\sim\mathcal{N}(\mathrm{x}|\mu_{1},\Sigma_{1})\quad\quad \quad Q(\mathrm{x})\sim\mathcal{N}(\mathrm{x}|\mu_{2},\Sigma_{2}) P(x)∼N(x∣μ1,Σ1)Q(x)∼N(x∣μ2,Σ2)
K L ( N ( x ∣ μ 1 , Σ 1 ) ∣ ∣ N ( x ∣ μ 2 , Σ 2 ) ) = ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) log 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) 1 ( 2 π ) d 2 ∣ Σ 2 ∣ 1 2 e − 1 2 ( x − μ 2 ) T Σ 2 − 1 ( x − μ 2 ) d x 1 ⋯ d x d = ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ − 1 ( x − μ 1 ) [ 1 2 log ∣ Σ 2 ∣ ∣ Σ 1 ∣ − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) + 1 2 ( x − μ 2 ) T Σ 2 − 1 ( x − μ 2 ) ] d x 1 ⋯ d x d \begin{aligned} &\mathrm{KL}\left(\mathcal{N}\left(\mathrm{x} \mid \mu_{1}, \Sigma_{1}\right)|| \mathcal{N}\left(\mathrm{x} \mid \mu_{2}, \Sigma_{2}\right)\right)\\ =&\int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)} \log \frac{\frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}}{\frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{2}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mathrm{x}-\mu_{2}\right)}} \mathrm{dx_{1 }} \cdots \mathrm{d} \mathrm{x}_{\mathrm{d}}\\ =&\int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left[\frac{1}{2} \log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|}-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)+\frac{1}{2}\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mathrm{x}-\mu_{2}\right)\right] \mathrm{dx}_{1}\cdots\mathrm{dx_{d}} \end{aligned} ==KL(N(x∣μ1,Σ1)∣∣N(x∣μ2,Σ2))∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)log(2π)2d∣Σ2∣211e−21(x−μ2)TΣ2−1(x−μ2)(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)dx1⋯dxd∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ−1(x−μ1)[21log∣Σ1∣∣Σ2∣−21(x−μ1)TΣ1−1(x−μ1)+21(x−μ2)TΣ2−1(x−μ2)]dx1⋯dxd
同样分布计算3项的结果:
第一项:
1 2 log ∣ Σ 2 ∣ ∣ Σ 1 ∣ ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) d x 1 ⋯ d x d = 1 2 log ∣ Σ 2 ∣ ∣ Σ 1 ∣ \frac{1}{2} \log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)} \mathrm{dx}_{1} \cdots \mathrm{dx_{ \textrm {d } }}=\frac{1}{2} \log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|} 21log∣Σ1∣∣Σ2∣∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)dx1⋯dxd =21log∣Σ1∣∣Σ2∣
第二项:
− 1 2 ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) d x 1 ⋯ d x d -\frac{1}{2} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right) \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}} −21∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ1)TΣ1−1(x−μ1)dx1⋯dxd
Σ 1 \Sigma_{1} Σ1为半正定对称矩阵,设 Σ 1 − 1 = U T U , y = U ( x − μ 1 ) \Sigma_{1}^{-1}=\mathrm{U^{T}U}, \mathrm{y}=\mathrm{U(x-\mu_{1})} Σ1−1=UTU,y=U(x−μ1), 由于线性变换矩阵就是雅克比矩阵,因此:
d y 1 ⋯ d y d = ∣ U ∣ d x 1 ⋯ d x d \mathrm{dy}_{1} \cdots \mathrm{dy}_{\mathrm{d}}=|\mathrm{U}| \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}} dy1⋯dyd=∣U∣dx1⋯dxd
由 ∣ Σ 1 − 1 = ∣ U 2 ∣ |\Sigma_{1}^{-1}=|\mathrm{U^{2}}| ∣Σ1−1=∣U2∣,可知 ∣ Σ 1 − 1 2 ∣ = ∣ Σ 1 ∣ − 1 2 = ∣ U ∣ |\Sigma_{1}^{-\frac{1}{2}}|=|\Sigma_{1}|^{-\frac{1}{2}}=|\mathrm{U}| ∣Σ1−21∣=∣Σ1∣−21=∣U∣, 因此:
− 1 2 ∣ Σ 1 ∣ 1 2 ∫ y 1 ⋯ ∫ y d 1 ( 2 π ) d 2 e − 1 2 y T y y T y ∣ U ∣ − 1 d y 1 ⋯ d y d = − 1 2 ∣ Σ 1 ∣ 1 2 ∣ Σ 1 ∣ 1 2 ⋅ d = − d 2 \begin{aligned} &-\frac{1}{2|\Sigma_{1}|^{\frac{1}{2}}}\int_{\mathrm{y_{1}}}\cdots\int_{\mathrm{y_{d}}}\frac{1}{(2\pi)^{\frac{d}{2}}}\mathrm{e^{-\frac{1}{2}y^{T}y}y^{T}y|U|^{-1}dy_{1}\cdots dy_{d}}\\ =&-\frac{1}{2\left|\Sigma_{1}\right|^{\frac{1}{2}}}\left|\Sigma_{1}\right|^{\frac{1}{2}} \cdot \mathrm{d}=-\frac{\mathrm{d}}{2} \\ \end{aligned} =−2∣Σ1∣211∫y1⋯∫yd(2π)2d1e−21yTyyTy∣U∣−1dy1⋯dyd−2∣Σ1∣211∣Σ1∣21⋅d=−2d
第三项:
需要用到的小技巧:
x T A x = tr ( A x x T ) \mathrm{x}^{T} A \mathrm{x}=\operatorname{tr}\left(A \mathrm{x}\mathrm{x}^{T}\right) xTAx=tr(AxxT)
1 2 ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x − μ 2 ) T Σ 2 − 1 ( x − μ 2 ) d x 1 ⋯ d x d = 1 2 ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) tr [ Σ 2 − 1 ( x − μ 2 ) ( x − μ 2 ) T ] d x 1 ⋯ d x d = 1 2 tr [ Σ 2 − 1 ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x − μ 2 ) ( x − μ 2 ) T ] d x 1 ⋯ d x d = 1 2 tr [ Σ 2 − 1 ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x x T − μ 2 x T − x 2 T + μ 2 μ 2 T ) ] d x 1 ⋯ d x d \begin{array}{l} \frac{1}{2} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x_{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mathrm{x}-\mu_{2}\right) \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}} \\ =\frac{1}{2} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{d}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)} \operatorname{tr}\left[\Sigma_{2}^{-1}\left(\mathrm{x}-\mu_{2}\right)\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}}\right] \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}} \\ =\frac{1}{2} \operatorname{tr}\left[\Sigma_{2}^{-1} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{x}-\mu_{2}\right)\left(\mathrm{x}-\mu_{2}\right)^{\mathrm{T}}\right] \mathrm{dx_{1 }} \cdots \mathrm{dx}_{\mathrm{d}} \\ =\frac{1}{2} \operatorname{tr}\left[\Sigma_{2}^{-1} \int_{\mathrm{x}_{1}} \cdots \int_{\mathrm{x}_{\mathrm{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{xx}^{\mathrm{T}}-\mu_{2} \mathrm{x}^{\mathrm{T}}-\mathrm{x}_{2}^{\mathrm{T}}+\mu_{2} \mu_{2}^{\mathrm{T}}\right)\right] \mathrm{dx}_{1} \cdots \mathrm{dx}_{\mathrm{d}} \end{array} 21∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ2)TΣ2−1(x−μ2)dx1⋯dxd=21∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)tr[Σ2−1(x−μ2)(x−μ2)T]dx1⋯dxd=21tr[Σ2−1∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ2)(x−μ2)T]dx1⋯dxd=21tr[Σ2−1∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(xxT−μ2xT−x2T+μ2μ2T)]dx1⋯dxd
其中积分之后第一项为均方值,第二、三项为均值,第三项为常数:
1 2 tr [ Σ 2 − 1 ∫ x 1 ⋯ ∫ x d 1 ( 2 π ) d 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x x T − μ 2 x T − x μ 2 T + μ 2 μ 2 T ) ] d x 1 ⋯ d x d = 1 2 tr [ Σ 2 − 1 ( Σ 1 + μ 1 μ 1 T − μ 2 μ 1 T − μ 1 μ 2 T + μ 2 μ 2 T ) ] = 1 2 [ tr ( Σ 2 − 1 Σ 1 ) + tr ( Σ 2 − 1 ( μ 1 − μ 2 ) ( μ 1 − μ 2 ) T ) ] = 1 2 [ tr ( Σ 2 − 1 Σ 1 ) + ( μ 1 − μ 2 ) T Σ 2 − 1 ( μ 1 − μ 2 ) ] \begin{array}{l} \frac{1}{2} \operatorname{tr}\left[\Sigma_{2}^{-1} \int_{\mathrm{x} 1} \cdots \int_{\mathrm{x_{d}}} \frac{1}{(2 \pi)^{\frac{\mathrm{d}}{2}}\left|\Sigma_{1}\right|^{\frac{1}{2}}} \mathrm{e}^{-\frac{1}{2}\left(\mathrm{x}-\mu_{1}\right)^{\mathrm{T}} \Sigma_{1}^{-1}\left(\mathrm{x}-\mu_{1}\right)}\left(\mathrm{xx}^{\mathrm{T}}-\mu_{2} \mathrm{x}^{\mathrm{T}}-\mathrm{x} \mu_{2}^{\mathrm{T}}+\mu_{2} \mu_{2}^{\mathrm{T}}\right)\right] \mathrm{dx_{1 }} \cdots \mathrm{dx_{ \textrm {d } }} \\ =\frac{1}{2} \operatorname{tr}\left[\Sigma_{2}^{-1}\left(\Sigma_{1}+\mu_{1} \mu_{1}^{\mathrm{T}}-\mu_{2} \mu_{1}^{\mathrm{T}}-\mu_{1} \mu_{2}^{\mathrm{T}}+\mu_{2} \mu_{2}^{\mathrm{T}}\right)\right] \\ =\frac{1}{2}\left[\operatorname{tr}\left(\Sigma_{2}^{-1} \Sigma_{1}\right)+\operatorname{tr}\left(\Sigma_{2}^{-1}\left(\mu_{1}-\mu_{2}\right)\left(\mu_{1}-\mu_{2}\right)^{\mathrm{T}}\right)\right] \\ =\frac{1}{2}\left[\operatorname{tr}\left(\Sigma_{2}^{-1} \Sigma_{1}\right)+\left(\mu_{1}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mu_{1}-\mu_{2}\right)\right] \end{array} 21tr[Σ2−1∫x1⋯∫xd(2π)2d∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(xxT−μ2xT−xμ2T+μ2μ2T)]dx1⋯dxd =21tr[Σ2−1(Σ1+μ1μ1T−μ2μ1T−μ1μ2T+μ2μ2T)]=21[tr(Σ2−1Σ1)+tr(Σ2−1(μ1−μ2)(μ1−μ2)T)]=21[tr(Σ2−1Σ1)+(μ1−μ2)TΣ2−1(μ1−μ2)]
整理最终结果,两个高斯分布的KL散度为:
K L ( N ( x ∣ μ 1 , Σ 1 ) ∣ ∣ N ( x ∣ μ 2 , Σ 2 ) ) = 1 2 [ log ∣ Σ 2 ∣ ∣ Σ 1 ∣ − K + tr ( Σ 2 − 1 Σ 1 ) + ( μ 1 − μ 2 ) T Σ 2 − 1 ( μ 1 − μ 2 ) ] \mathrm{KL}\left(\mathcal{N}\left(\mathrm{x} \mid \mu_{1}, \Sigma_{1}\right)|| \mathcal{N}\left(\mathrm{x} \mid \mu_{2}, \Sigma_{2}\right)\right)=\frac{1}{2}\left[\log \frac{\left|\Sigma_{2}\right|}{\left|\Sigma_{1}\right|}-\mathrm{K}+\operatorname{tr}\left(\Sigma_{2}^{-1} \Sigma_{1}\right)+\left(\mu_{1}-\mu_{2}\right)^{\mathrm{T}} \Sigma_{2}^{-1}\left(\mu_{1}-\mu_{2}\right)\right] KL(N(x∣μ1,Σ1)∣∣N(x∣μ2,Σ2))=21[log∣Σ1∣∣Σ2∣−K+tr(Σ2−1Σ1)+(μ1−μ2)TΣ2−1(μ1−μ2)]
参考博客
Diffusion Model中的高斯分布的KL散度