一元高斯分布的概率分布函数为
f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 . (1.1) f\left( x \right) = \frac{1}{{\sqrt {2\pi } \sigma }}{e^{ - \frac{{{{\left( {x - \mu } \right)}^2}}}{{2{\sigma ^2}}}}}.\tag{1.1} f(x)=2πσ1e−2σ2(x−μ)2.(1.1)
给定两个随机分布 p ( x ) p\left( x \right) p(x) 与 q ( x ) q\left( x \right) q(x),两者的 KL 散度定义为
K L ( p ∥ q ) = ∫ p ( x ) log p ( x ) q ( x ) d x . (1.2) KL\left( {p\parallel q} \right) = \int {p\left( x \right)\log \frac{{p\left( x \right)}}{{q\left( x \right)}}dx} .\tag{1.2} KL(p∥q)=∫p(x)logq(x)p(x)dx.(1.2)
那么,对于两独立高斯分布 X 1 ∼ p ( x ) = N ( μ 1 , σ 1 2 ) {X_1} \sim p\left( x \right) = {\cal N}\left( {{\mu _1},\sigma _1^2} \right) X1∼p(x)=N(μ1,σ12) 与 X 2 ∼ q ( x ) = N ( μ 2 , σ 2 2 ) {X_2} \sim q\left( x \right) = {\cal N}\left( {{\mu _2},\sigma _2^2} \right) X2∼q(x)=N(μ2,σ22),可求得两分布的 KL 散度为
K L ( p ∥ q ) = ∫ p ( x ) log p ( x ) q ( x ) d x = ∫ p ( x ) log σ 2 σ 1 e − ( x − μ 1 ) 2 2 σ 1 2 + ( x − μ 2 ) 2 2 σ 2 2 d x = ∫ p ( x ) log σ 2 σ 1 d x + ∫ p ( x ) [ − ( x − μ 1 ) 2 2 σ 1 2 + ( x − μ 2 ) 2 2 σ 2 2 ] d x = σ 1 2 − σ 2 2 2 σ 1 2 σ 2 2 ∫ p ( x ) x 2 d x + μ 1 σ 2 2 − μ 2 σ 1 2 σ 1 2 σ 2 2 ∫ p ( x ) x d x + ( μ 2 2 σ 1 2 − μ 1 2 σ 2 2 2 σ 1 2 σ 2 2 + log σ 2 σ 1 ) ∫ p ( x ) d x . (1.3) \begin{aligned} KL\left( {p\parallel q} \right) &= \int {p\left( x \right)\log \frac{{p\left( x \right)}}{{q\left( x \right)}}dx} \\ &= \int {p\left( x \right)\log \frac{{{\sigma _2}}}{{{\sigma _1}}}{e^{ - \frac{{{{\left( {x - {\mu _1}} \right)}^2}}}{{2\sigma _1^2}} + \frac{{{{\left( {x - {\mu _2}} \right)}^2}}}{{2\sigma _2^2}}}}dx} \\ &= \int {p\left( x \right)\log \frac{{{\sigma _2}}}{{{\sigma _1}}}dx} + \int {p\left( x \right)\left[ { - \frac{{{{\left( {x - {\mu _1}} \right)}^2}}}{{2\sigma _1^2}} + \frac{{{{\left( {x - {\mu _2}} \right)}^2}}}{{2\sigma _2^2}}} \right]dx} \\ &= \frac{{\sigma _1^2 - \sigma _2^2}}{{2\sigma _1^2\sigma _2^2}}\int {p\left( x \right){x^2}dx} + \frac{{{\mu _1}\sigma _2^2 - {\mu _2}\sigma _1^2}}{{\sigma _1^2\sigma _2^2}}\int {p\left( x \right)xdx} \\ &+ \left( {\frac{{\mu _2^2\sigma _1^2 - \mu _1^2\sigma _2^2}}{{2\sigma _1^2\sigma _2^2}} + \log \frac{{{\sigma _2}}}{{{\sigma _1}}}} \right)\int {p\left( x \right)dx} . \end{aligned} \tag{1.3} KL(p∥q)=∫p(x)logq(x)p(x)dx=∫p(x)logσ1σ2e−2σ12(x−μ1)2+2σ22(x−μ2)2dx=∫p(x)logσ1σ2dx+∫p(x)[−2σ12(x−μ1)2+2σ22(x−μ2)2]dx=2σ12σ22σ12−σ22∫p(x)x2dx+σ12σ22μ1σ22−μ2σ12∫p(x)xdx+(2σ12σ22μ22σ12−μ12σ22+logσ1σ2)∫p(x)dx.(1.3)
以上可分为三部分。对于第二部分,积分部分即为 E ( X 1 ) E\left( {{X_1}} \right) E(X1),所以有
μ 1 σ 2 2 − μ 2 σ 1 2 σ 1 2 σ 2 2 ∫ p ( x ) x d x = μ 1 σ 2 2 − μ 2 σ 1 2 σ 1 2 σ 2 2 E ( X 1 ) = μ 1 2 σ 2 2 − μ 1 μ 2 σ 1 2 σ 1 2 σ 2 2 . (1.4) \frac{{{\mu _1}\sigma _2^2 - {\mu _2}\sigma _1^2}}{{\sigma _1^2\sigma _2^2}}\int {p\left( x \right)xdx} = \frac{{{\mu _1}\sigma _2^2 - {\mu _2}\sigma _1^2}}{{\sigma _1^2\sigma _2^2}}E\left( {{X_1}} \right) = \frac{{\mu _1^2\sigma _2^2 - {\mu _1}{\mu _2}\sigma _1^2}}{{\sigma _1^2\sigma _2^2}}.\tag{1.4} σ12σ22μ1σ22−μ2σ12∫p(x)xdx=σ12σ22μ1σ22−μ2σ12E(X1)=σ12σ22μ12σ22−μ1μ2σ12.(1.4)
对于第三部分,积分部分即为 E ( 1 ) E\left( 1 \right) E(1),恒为 1,所以有
( μ 2 2 σ 1 2 − μ 1 2 σ 2 2 2 σ 1 2 σ 2 2 + log σ 2 σ 1 ) ∫ p ( x ) d x = μ 2 2 σ 1 2 − μ 1 2 σ 2 2 2 σ 1 2 σ 2 2 + log σ 2 σ 1 . (1.5) \left( {\frac{{\mu _2^2\sigma _1^2 - \mu _1^2\sigma _2^2}}{{2\sigma _1^2\sigma _2^2}} + \log \frac{{{\sigma _2}}}{{{\sigma _1}}}} \right)\int {p\left( x \right)dx} = \frac{{\mu _2^2\sigma _1^2 - \mu _1^2\sigma _2^2}}{{2\sigma _1^2\sigma _2^2}} + \log \frac{{{\sigma _2}}}{{{\sigma _1}}}.\tag{1.5} (2σ12σ22μ22σ12−μ12σ22+logσ1σ2)∫p(x)dx=2σ12σ22μ22σ12−μ12σ22+logσ1σ2.(1.5)
对于第一部分,积分部分即为 E ( X 1 2 ) E\left( {X_1^2} \right) E(X12),根据方差与均值的关系,即
D ( X ) = E ( X 2 ) − E 2 ( X ) . (1.6) D\left( X \right) = E\left( {{X^2}} \right) - {E^2}\left( X \right).\tag{1.6} D(X)=E(X2)−E2(X).(1.6)
所以有
σ 1 2 − σ 2 2 2 σ 1 2 σ 2 2 ∫ p ( x ) x 2 d x = σ 1 2 − σ 2 2 2 σ 1 2 σ 2 2 [ D ( X 1 ) + E 2 ( X 1 ) ] = σ 1 2 − σ 2 2 2 σ 1 2 σ 2 2 ( σ 1 2 + μ 1 2 ) . (1.7) \frac{{\sigma _1^2 - \sigma _2^2}}{{2\sigma _1^2\sigma _2^2}}\int {p\left( x \right){x^2}dx} = \frac{{\sigma _1^2 - \sigma _2^2}}{{2\sigma _1^2\sigma _2^2}}\left[ {D\left( {{X_1}} \right) + {E^2}\left( {{X_1}} \right)} \right] = \frac{{\sigma _1^2 - \sigma _2^2}}{{2\sigma _1^2\sigma _2^2}}\left( {\sigma _1^2 + \mu _1^2} \right).\tag{1.7} 2σ12σ22σ12−σ22∫p(x)x2dx=2σ12σ22σ12−σ22[D(X1)+E2(X1)]=2σ12σ22σ12−σ22(σ12+μ12).(1.7)
综合可得
K L ( p ∥ q ) = σ 1 2 − σ 2 2 2 σ 1 2 σ 2 2 ( σ 1 2 + μ 1 2 ) + μ 1 2 σ 2 2 − μ 1 μ 2 σ 1 2 σ 1 2 σ 2 2 + μ 2 2 σ 1 2 − μ 1 2 σ 2 2 2 σ 1 2 σ 2 2 + log σ 2 σ 1 = ( μ 1 − μ 2 ) 2 + ( σ 1 2 − σ 2 2 ) 2 σ 2 2 + log σ 2 σ 1 . (1.8) \begin{aligned} KL\left( {p\parallel q} \right) &= \frac{{\sigma _1^2 - \sigma _2^2}}{{2\sigma _1^2\sigma _2^2}}\left( {\sigma _1^2 + \mu _1^2} \right) + \frac{{\mu _1^2\sigma _2^2 - {\mu _1}{\mu _2}\sigma _1^2}}{{\sigma _1^2\sigma _2^2}} + \frac{{\mu _2^2\sigma _1^2 - \mu _1^2\sigma _2^2}}{{2\sigma _1^2\sigma _2^2}} + \log \frac{{{\sigma _2}}}{{{\sigma _1}}}\\ &= \frac{{{{\left( {{\mu _1} - {\mu _2}} \right)}^2} + \left( {\sigma _1^2 - \sigma _2^2} \right)}}{{2\sigma _2^2}} + \log \frac{{{\sigma _2}}}{{{\sigma _1}}}. \end{aligned} \tag{1.8} KL(p∥q)=2σ12σ22σ12−σ22(σ12+μ12)+σ12σ22μ12σ22−μ1μ2σ12+2σ12σ22μ22σ12−μ12σ22+logσ1σ2=2σ22(μ1−μ2)2+(σ12−σ22)+logσ1σ2.(1.8)
当 X 2 ∼ q ( x ) = N ( 0 , 1 ) {X_2} \sim q\left( x \right) = {\cal N}\left( {0,1} \right) X2∼q(x)=N(0,1) 即为标准正态分布时,可化简为
K L ( p ∥ q ) = − log σ 1 + μ 1 2 + σ 1 2 − 1 2 = − 1 2 ( 1 + log σ 1 2 − μ 1 2 − σ 1 2 ) . (1.9) \begin{aligned} KL\left( {p\parallel q} \right) &= - \log {\sigma _1} + \frac{{\mu _1^2 + \sigma _1^2 - 1}}{2}\\ &= - \frac{1}{2}\left( {1 + \log \sigma _1^2 - \mu _1^2 - \sigma _1^2} \right). \end{aligned} \tag{1.9} KL(p∥q)=−logσ1+2μ12+σ12−1=−21(1+logσ12−μ12−σ12).(1.9)
因此,高斯分布之间的KL散度具有非常简洁的解析式,而无需进行真正的积分运算,这也是为什么高斯分布被广泛应用于概率分析的原因。对于多元高斯分布,一般假设各分量独立,即协方差矩阵为对角矩阵,此时只需要将各分量的 KL 散度求和即可。