两个高斯分布的KL散度其实很简单,只要找到合适的方法。
一. 一维高斯分布
KL散度的定义为:
K L ( N ( μ 1 , σ 1 2 ) ∣ ∣ N ( μ 2 , σ 2 2 ) ) = ∫ x 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 log 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 1 2 π σ 2 e − ( x − μ 2 ) 2 2 σ 2 2 d x = ∫ x 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 [ log σ 2 σ 1 − ( x − μ 1 ) 2 2 σ 1 2 + ( x − μ 2 ) 2 2 σ 2 2 ] d x \begin{aligned} KL(\mathcal{N}(\mu_1, \sigma_1^2) || \mathcal{N}(\mu_2, \sigma_2^2)) &= \int_x \frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} \log \frac{\frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}}}{\frac{1}{\sqrt{2\pi}\sigma_2} e^{-\frac{(x-\mu_2)^2}{2\sigma_2^2}}} dx \\ &= \int_x \frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} \Bigg[ \log \frac{\sigma_2}{\sigma_1} - \frac{(x-\mu_1)^2}{2\sigma_1^2} + \frac{(x-\mu_2)^2}{2\sigma_2^2} \Bigg] dx \end{aligned} KL(N(μ1,σ12)∣∣N(μ2,σ22))=∫x2πσ11e−2σ12(x−μ1)2log2πσ21e−2σ22(x−μ2)22πσ11e−2σ12(x−μ1)2dx=∫x2πσ11e−2σ12(x−μ1)2[logσ1σ2−2σ12(x−μ1)2+2σ22(x−μ2)2]dx
第一项很简单,用全积分为1的性质即可:
log σ 2 σ 1 ∫ x 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 d x = log σ 2 σ 1 \begin{aligned} \log \frac{\sigma_2}{\sigma_1} \int_x \frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} dx = \log \frac{\sigma_2}{\sigma_1} \end{aligned} logσ1σ2∫x2πσ11e−2σ12(x−μ1)2dx=logσ1σ2
第二项需要分辨出积分项为方差:
− 1 2 σ 1 2 ∫ x ( x − μ 1 ) 2 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 d x = − 1 2 σ 1 2 σ 1 2 = − 1 2 \begin{aligned} -\frac{1}{2\sigma_1^2} \int_x (x-\mu_1)^2 \frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} dx = -\frac{1}{2\sigma_1^2} \sigma_1^2 = -\frac{1}{2} \end{aligned} −2σ121∫x(x−μ1)22πσ11e−2σ12(x−μ1)2dx=−2σ121σ12=−21
第三项的积分内部分别是均方值、均值和常数,因此可以得到:
1 2 σ 2 2 ∫ x ( x − μ 2 ) 2 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 d x = 1 2 σ 2 2 ∫ x ( x 2 − 2 μ 2 x + μ 2 2 ) 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 d x = σ 1 2 + μ 1 2 − 2 μ 1 μ 2 + μ 2 2 2 σ 2 2 = σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 \begin{aligned} \frac{1}{2\sigma_2^2} \int_x (x-\mu_2)^2 \frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} dx &= \frac{1}{2\sigma_2^2} \int_x ( x^2 - 2\mu_2 x + \mu_2^2 ) \frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} dx \\ &= \frac{\sigma_1^2 + \mu_1^2 - 2 \mu_1 \mu_2+ \mu_2^2}{2\sigma_2^2} = \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} \end{aligned} 2σ221∫x(x−μ2)22πσ11e−2σ12(x−μ1)2dx=2σ221∫x(x2−2μ2x+μ22)2πσ11e−2σ12(x−μ1)2dx=2σ22σ12+μ12−2μ1μ2+μ22=2σ22σ12+(μ1−μ2)2
也可以用一个小技巧来化简,其中第一项为方差,第二项为奇函数全积分为0,第三项为常数可以提取为系数:
1 2 σ 2 2 ∫ x ( x − μ 2 ) 2 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 d x = 1 2 σ 2 2 ∫ x [ ( x − μ 1 ) 2 + 2 ( μ 1 − μ 2 ) ( x − μ 1 ) + ( μ 1 − μ 2 ) 2 ] 1 2 π σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 d x = σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 \begin{aligned} \frac{1}{2\sigma_2^2} \int_x (x-\mu_2)^2 \frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} dx &= \frac{1}{2\sigma_2^2} \int_x \big[ (x-\mu_1)^2 + 2(\mu_1 - \mu_2)(x - \mu_1) + (\mu_1 - \mu_2)^2 \big] \frac{1}{\sqrt{2\pi}\sigma_1} e^{-\frac{(x-\mu_1)^2}{2\sigma_1^2}} dx \\ &= \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} \end{aligned} 2σ221∫x(x−μ2)22πσ11e−2σ12(x−μ1)2dx=2σ221∫x[(x−μ1)2+2(μ1−μ2)(x−μ1)+(μ1−μ2)2]2πσ11e−2σ12(x−μ1)2dx=2σ22σ12+(μ1−μ2)2
整理最终结果,两个高斯分布的KL散度为:
K L ( N ( μ 1 , σ 1 2 ) ∣ ∣ N ( μ 2 , σ 2 2 ) ) = log σ 2 σ 1 − 1 2 + σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 KL(\mathcal{N}(\mu_1, \sigma_1^2) || \mathcal{N}(\mu_2, \sigma_2^2)) = \log \frac{\sigma_2}{\sigma_1} -\frac{1}{2} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} KL(N(μ1,σ12)∣∣N(μ2,σ22))=logσ1σ2−21+2σ22σ12+(μ1−μ2)2
二. 多元高斯分布
N ( x ∣ μ , Σ ) = 1 ( 2 π ) K 2 ∣ Σ ∣ 1 2 e − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) \begin{aligned} \mathcal{N}(x | \mu, \Sigma) = \frac{1}{(2\pi)^\frac{K}{2} |\Sigma|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)} \end{aligned} N(x∣μ,Σ)=(2π)2K∣Σ∣211e−21(x−μ)TΣ−1(x−μ)
K L ( N ( x ∣ μ 1 , Σ 1 ) ∣ ∣ N ( x ∣ μ 2 , Σ 2 ) ) = ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) log 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) 1 ( 2 π ) K 2 ∣ Σ 2 ∣ 1 2 e − 1 2 ( x − μ 2 ) T Σ 2 − 1 ( x − μ 2 ) d x 1 ⋯ d x K = ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ − 1 ( x − μ 1 ) [ 1 2 log ∣ Σ 2 ∣ ∣ Σ 1 ∣ − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) + 1 2 ( x − μ 2 ) T Σ 2 − 1 ( x − μ 2 ) ] d x 1 ⋯ d x K \begin{aligned} KL(\mathcal{N}(x | \mu_1, \Sigma_1) || \mathcal{N}(x | \mu_2, \Sigma_2)) &= \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)} \log \frac{\frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)}}{\frac{1}{(2\pi)^\frac{K}{2} |\Sigma_2|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2)}} dx_1 \cdots dx_K \\ &= \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma^{-1} (x - \mu_1)} \Bigg[ \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} - \frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + \frac{1}{2}(x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \Bigg] dx_1 \cdots dx_K \end{aligned} KL(N(x∣μ1,Σ1)∣∣N(x∣μ2,Σ2))=∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)log(2π)2K∣Σ2∣211e−21(x−μ2)TΣ2−1(x−μ2)(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)dx1⋯dxK=∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ−1(x−μ1)[21log∣Σ1∣∣Σ2∣−21(x−μ1)TΣ1−1(x−μ1)+21(x−μ2)TΣ2−1(x−μ2)]dx1⋯dxK
同样分别计算三项的结果,第一项:
1 2 log ∣ Σ 2 ∣ ∣ Σ 1 ∣ ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) d x 1 ⋯ d x K = 1 2 log ∣ Σ 2 ∣ ∣ Σ 1 ∣ \begin{aligned} \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)} dx_1 \cdots dx_K = \frac{1}{2} \log \frac{|\Sigma_2|}{|\Sigma_1|} \end{aligned} 21log∣Σ1∣∣Σ2∣∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)dx1⋯dxK=21log∣Σ1∣∣Σ2∣
第二项:
− 1 2 ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) d x 1 ⋯ d x K \begin{aligned} &-\frac{1}{2} \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)} (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) dx_1 \cdots dx_K \\ \end{aligned} −21∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ1)TΣ1−1(x−μ1)dx1⋯dxK
Σ 1 \Sigma_1 Σ1为半正定对称矩阵,设 Σ 1 − 1 = U T U \Sigma_1^{-1} = U^T U Σ1−1=UTU, y = U ( x − μ 1 ) y = U(x - \mu_1) y=U(x−μ1),由于线性变换矩阵就是雅克比矩阵,因此
d y 1 ⋯ d y K = ∣ U ∣ d x 1 ⋯ d x K dy_1 \cdots dy_K = |U| dx_1 \cdots dx_K dy1⋯dyK=∣U∣dx1⋯dxK
由 ∣ Σ 1 − 1 ∣ = ∣ U ∣ 2 |\Sigma_1^{-1}| = |U|^2 ∣Σ1−1∣=∣U∣2,可知 ∣ Σ 1 − 1 2 ∣ = ∣ Σ 1 ∣ − 1 2 = ∣ U ∣ |\Sigma_1^{-\frac{1}{2}}| = |\Sigma_1|^{-\frac{1}{2}} = |U| ∣Σ1−21∣=∣Σ1∣−21=∣U∣,因此
− 1 2 ∣ Σ 1 ∣ 1 2 ∫ y 1 ⋯ ∫ y K 1 ( 2 π ) K 2 e − 1 2 y T y y T y ∣ U ∣ − 1 d y 1 ⋯ d y K = − 1 2 ∣ Σ 1 ∣ 1 2 ∣ Σ 1 ∣ 1 2 ⋅ K = − K 2 \begin{aligned} &-\frac{1}{2 |\Sigma_1|^{\frac{1}{2}}} \int_{y_1} \cdots \int_{y_K} \frac{1}{(2\pi)^\frac{K}{2} } e^{-\frac{1}{2} y^Ty} y^Ty |U|^{-1} dy_1 \cdots dy_K \\ &= -\frac{1}{2 |\Sigma_1|^{\frac{1}{2}}} |\Sigma_1|^{\frac{1}{2}} \cdot K = -\frac{K}{2} \end{aligned} −2∣Σ1∣211∫y1⋯∫yK(2π)2K1e−21yTyyTy∣U∣−1dy1⋯dyK=−2∣Σ1∣211∣Σ1∣21⋅K=−2K
第三项需要利用一个小技巧:
x T A x = t r ( A x x T ) x^T A x = tr(A xx^T) xTAx=tr(AxxT)
1 2 ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x − μ 2 ) T Σ 2 − 1 ( x − μ 2 ) d x 1 ⋯ d x K = 1 2 ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) t r [ Σ 2 − 1 ( x − μ 2 ) ( x − μ 2 ) T ] d x 1 ⋯ d x K = 1 2 t r [ Σ 2 − 1 ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x − μ 2 ) ( x − μ 2 ) T ] d x 1 ⋯ d x K = 1 2 t r [ Σ 2 − 1 ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x x T − μ 2 x T − x μ 2 T + μ 2 μ 2 T ) ] d x 1 ⋯ d x K \begin{aligned} &\frac{1}{2} \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)} (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) dx_1 \cdots dx_K \\ &= \frac{1}{2} \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)} tr[ \Sigma_2^{-1} (x - \mu_2) (x - \mu_2)^T ] dx_1 \cdots dx_K\\ &= \frac{1}{2} tr \Bigg[ \Sigma_2^{-1} \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)} (x - \mu_2) (x - \mu_2)^T \Bigg] dx_1 \cdots dx_K\\ &= \frac{1}{2} tr \Bigg[ \Sigma_2^{-1} \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)} (x x^T - \mu_2 x^T - x \mu_2^T + \mu_2 \mu_2^T ) \Bigg] dx_1 \cdots dx_K\\ \end{aligned} 21∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ2)TΣ2−1(x−μ2)dx1⋯dxK=21∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)tr[Σ2−1(x−μ2)(x−μ2)T]dx1⋯dxK=21tr[Σ2−1∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(x−μ2)(x−μ2)T]dx1⋯dxK=21tr[Σ2−1∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(xxT−μ2xT−xμ2T+μ2μ2T)]dx1⋯dxK
其中积分之后第一项为均方值,第二、三项为均值,第三项为常数:
1 2 t r [ Σ 2 − 1 ∫ x 1 ⋯ ∫ x K 1 ( 2 π ) K 2 ∣ Σ 1 ∣ 1 2 e − 1 2 ( x − μ 1 ) T Σ 1 − 1 ( x − μ 1 ) ( x x T − μ 2 x T − x μ 2 T + μ 2 μ 2 T ) ] d x 1 ⋯ d x K = 1 2 t r [ Σ 2 − 1 ( Σ 1 + μ 1 μ 1 T − μ 2 μ 1 T − μ 1 μ 2 T + μ 2 μ 2 T ) ] = 1 2 [ t r ( Σ 2 − 1 Σ 1 ) + t r ( Σ 2 − 1 ( μ 1 − μ 2 ) ( μ 1 − μ 2 ) T ) ] = 1 2 [ t r ( Σ 2 − 1 Σ 1 ) + ( μ 1 − μ 2 ) T Σ 2 − 1 ( μ 1 − μ 2 ) ] \begin{aligned} &\frac{1}{2} tr \Bigg[ \Sigma_2^{-1} \int_{x_1} \cdots \int_{x_K} \frac{1}{(2\pi)^\frac{K}{2} |\Sigma_1|^{\frac{1}{2}}} e^{-\frac{1}{2}(x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1)} (x x^T - \mu_2 x^T - x \mu_2^T + \mu_2 \mu_2^T ) \Bigg] dx_1 \cdots dx_K\\ &= \frac{1}{2} tr [ \Sigma_2^{-1} (\Sigma_1 + \mu_1 \mu_1^T - \mu_2 \mu_1^T - \mu_1 \mu_2^T + \mu_2 \mu_2^T)] \\ &= \frac{1}{2} \big[ tr ( \Sigma_2^{-1} \Sigma_1 ) + tr( \Sigma_2^{-1} (\mu_1 - \mu_2) (\mu_1 - \mu_2)^T ) \big] \\ &= \frac{1}{2} \big[ tr ( \Sigma_2^{-1} \Sigma_1 ) + (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) \big] \\ \end{aligned} 21tr[Σ2−1∫x1⋯∫xK(2π)2K∣Σ1∣211e−21(x−μ1)TΣ1−1(x−μ1)(xxT−μ2xT−xμ2T+μ2μ2T)]dx1⋯dxK=21tr[Σ2−1(Σ1+μ1μ1T−μ2μ1T−μ1μ2T+μ2μ2T)]=21[tr(Σ2−1Σ1)+tr(Σ2−1(μ1−μ2)(μ1−μ2)T)]=21[tr(Σ2−1Σ1)+(μ1−μ2)TΣ2−1(μ1−μ2)]
整理最终结果,两个高斯分布的KL散度为:
K L ( N ( x ∣ μ 1 , Σ 1 ) ∣ ∣ N ( x ∣ μ 2 , Σ 2 ) ) = 1 2 [ log ∣ Σ 2 ∣ ∣ Σ 1 ∣ − K + t r ( Σ 2 − 1 Σ 1 ) + ( μ 1 − μ 2 ) T Σ 2 − 1 ( μ 1 − μ 2 ) ] \begin{aligned} KL(\mathcal{N}(x | \mu_1, \Sigma_1) || \mathcal{N}(x | \mu_2, \Sigma_2)) = \frac{1}{2} \Bigg[ \log \frac{|\Sigma_2|}{|\Sigma_1|} - K + tr ( \Sigma_2^{-1} \Sigma_1 ) + (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) \Bigg] \\ \end{aligned} KL(N(x∣μ1,Σ1)∣∣N(x∣μ2,Σ2))=21[log∣Σ1∣∣Σ2∣−K+tr(Σ2−1Σ1)+(μ1−μ2)TΣ2−1(μ1−μ2)]