大家对贝叶斯公式应该都很熟悉
P ( Z ∣ X ) = p ( X , Z ) ∫ z p ( X , Z = z ) d z P(Z|X)=\frac{p(X,Z)}{\int_z p(X,Z=z)dz} P(Z∣X)=∫zp(X,Z=z)dzp(X,Z)
我们称 P ( Z ∣ X ) P(Z|X) P(Z∣X)为posterior distribution
。posterior distribution
的计算通常是非常困难的,为什么呢?
假设 Z Z Z是一个高维的随机变量,如果要求 P ( Z = z ∣ X = x ) P(Z=z|X=x) P(Z=z∣X=x),我们不可避免的要计算 ∫ z p ( X = x , Z = z ) d z \int_z p(X=x,Z=z)dz ∫zp(X=x,Z=z)dz,由于 Z Z Z是高维随机变量,这个积分是相当难算的。
variational inference
就是用来计算posterior distribution
的。
variational inference
的核心思想包含两步:
总结称一句话就是,用一个简单的分布 q ( z ; λ ) q(z;\lambda) q(z;λ) 拟合复杂的分布 p ( z ∣ x ) p(z|x) p(z∣x)
这种策略将计算 p ( z ∣ x ) p(z|x) p(z∣x) 的问题转化成优化问题了
λ ∗ = arg min λ d i v e r g e n c e ( p ( z ∣ x ) , q ( z ; λ ) ) \lambda^* = \arg\min_{\lambda}~divergence(p(z|x),q(z;\lambda)) λ∗=argλmin divergence(p(z∣x),q(z;λ))
收敛后,就可以用 q ( z ; λ ) q(z;\lambda) q(z;λ) 来代替 p ( z ∣ x ) p(z|x) p(z∣x)了
对概率求对数
log P ( x ) = log P ( x , z ) − log P ( z ∣ x ) = log P ( x , z ) Q ( z ; λ ) − log P ( z ∣ x ) Q ( z ; λ ) \begin{aligned} \text{log}P(x) &= \text{log}P(x,z)-\text{log}P(z|x) \\ &=\text{log}\frac{P(x,z)}{Q(z;\lambda)}-\text{log}\frac{P(z|x)}{Q(z;\lambda)} \end{aligned} logP(x)=logP(x,z)−logP(z∣x)=logQ(z;λ)P(x,z)−logQ(z;λ)P(z∣x)
等式的两边同时对分布 Q ( z ) Q(z) Q(z)求期望,可以得到
E q ( z ; λ ) log P ( x ) = E q ( z ; λ ) log P ( x , z ) − E q ( z ; λ ) log P ( z ∣ x ) log P ( x ) = E q ( z ; λ ) log p ( x , z ) q ( z ; λ ) − E q ( z ; λ ) log p ( z ∣ x ) q ( z ; λ ) = K L ( q ( z ; λ ) ∣ ∣ p ( z ∣ x ) ) + E q ( z ; λ ) log p ( x , z ) q ( z ; λ ) log P ( x ) = K L ( q ( z ; λ ) ∣ ∣ p ( z ∣ x ) ) + E q ( z ; λ ) log p ( x , z ) q ( z ; λ ) \begin{aligned} \mathbb E_{q(z;\lambda)}\text{log}P(x) &= \mathbb E_{q(z;\lambda)}\text{log}P(x,z)-\mathbb E_{q(z;\lambda)}\text{log}P(z|x) \\ \text{log}P(x)&=\mathbb E_{q(z;\lambda)}\text{log}\frac{p(x,z)}{q(z;\lambda)}-\mathbb E_{q(z;\lambda)}\text{log}\frac{p(z|x)}{q(z;\lambda)} \\ &=KL(q(z;\lambda)||p(z|x))+\mathbb E_{q(z;\lambda)}\text{log}\frac{p(x,z)}{q(z;\lambda)}\\ \text{log}P(x)&=KL(q(z;\lambda)||p(z|x))+\mathbb E_{q(z;\lambda)}\text{log}\frac{p(x,z)}{q(z;\lambda)} \end{aligned} Eq(z;λ)logP(x)logP(x)logP(x)=Eq(z;λ)logP(x,z)−Eq(z;λ)logP(z∣x)=Eq(z;λ)logq(z;λ)p(x,z)−Eq(z;λ)logq(z;λ)p(z∣x)=KL(q(z;λ)∣∣p(z∣x))+Eq(z;λ)logq(z;λ)p(x,z)=KL(q(z;λ)∣∣p(z∣x))+Eq(z;λ)logq(z;λ)p(x,z)
我们的目标是使 q ( z : λ ) q(z:\lambda) q(z:λ) 靠近 p ( z ∣ x ) p(z|x) p(z∣x) ,就是 min λ K L ( q ( z ; λ ) ∣ ∣ p ( z ∣ x ) ) \min_\lambda KL(q(z;\lambda)||p(z|x)) minλKL(q(z;λ)∣∣p(z∣x)) ,由于 K L ( q ( z ; λ ) ∣ ∣ p ( z ∣ x ) ) KL(q(z;\lambda)||p(z|x)) KL(q(z;λ)∣∣p(z∣x)) 中包含 p ( z ∣ x ) p(z|x) p(z∣x) ,这项非常难求。将 λ \lambda λ看做变量时, log P ( x ) \text{log}P(x) logP(x) 为常量,所以, min λ K L ( q ( z ; λ ) ∣ ∣ p ( z ∣ x ) ) \min_\lambda KL(q(z;\lambda)||p(z|x)) minλKL(q(z;λ)∣∣p(z∣x)) 等价于 max λ E q ( z ; λ ) log p ( x , z ) q ( z ; λ ) \max_\lambda \mathbb E_{q(z;\lambda)}\text{log}\frac{p(x,z)}{q(z;\lambda)} maxλEq(z;λ)logq(z;λ)p(x,z)。 E q ( z ; λ ) [ log p ( x , z ) − log q ( z ; λ ) ] \mathbb E_{q(z;\lambda)}[\text{log}p(x,z)-\text{log}q(z;\lambda)] Eq(z;λ)[logp(x,z)−logq(z;λ)] 称为Evidence Lower Bound(ELBO
)。
现在,variational inference
的目标变成 max λ E q ( z ; λ ) [ log p ( x , z ) − log q ( z ; λ ) ] \max_\lambda \mathbb E_{q(z;\lambda)}[\text{log}p(x,z)-\text{log}q(z;\lambda)] λmaxEq(z;λ)[logp(x,z)−logq(z;λ)]
为什么称之为ELBO
呢?
p ( x ) p(x) p(x)一般被称之为evidence
,又因为 K L ( q ∣ ∣ p ) > = 0 KL(q||p)>=0 KL(q∣∣p)>=0, 所以 p ( x ) > = E q ( z ; λ ) [ log p ( x , z ) − log q ( z ; λ ) ] p(x)>=E_{q(z;\lambda)}[\text{log}p(x,z)-\text{log}q(z;\lambda)] p(x)>=Eq(z;λ)[logp(x,z)−logq(z;λ)], 这就是为什么被称为ELBO
继续看一下ELBO
E L B O ( λ ) = E q ( z ; λ ) [ log p ( x , z ) − log q ( z ; λ ) ] = E q ( z ; λ ) log p ( x , z ) − E q ( z ; λ ) log q ( z ; λ ) = E q ( z ; λ ) log p ( x , z ) + H ( q ) \begin{aligned} ELBO(\lambda) &= \mathbb E_{q(z;\lambda)}[\text{log}p(x,z)-\text{log}q(z;\lambda)] \\ &= \mathbb E_{q(z;\lambda)}\text{log}p(x,z) -\mathbb E_{q(z;\lambda)}\text{log}q(z;\lambda)\\ &= \mathbb E_{q(z;\lambda)}\text{log}p(x,z) + H(q) \end{aligned} ELBO(λ)=Eq(z;λ)[logp(x,z)−logq(z;λ)]=Eq(z;λ)logp(x,z)−Eq(z;λ)logq(z;λ)=Eq(z;λ)logp(x,z)+H(q)
The first term represents an energy. The energy encourages q q q to focus probability mass where the model puts high probability, p ( x , z ) p(\mathbf{x}, \mathbf{z}) p(x,z). The entropy encourages q q q to spread probability mass to avoid concentrating to one location.
假设 Z Z Z包含K个随机变量( 当然,每个随机变量也有可能为多元随机变量),我们假设:
q ( Z ; λ ) = ∏ k = 1 K q k ( Z k ; λ k ) q(Z;\lambda) = \prod_{k=1}^{K}q_k(Z_k;\lambda_k) q(Z;λ)=k=1∏Kqk(Zk;λk)
这个被称为mean field approximation
。关于mean field approximation
,https://metacademy.org/graphs/concepts/mean_field
ELBO
则变成
E L B O ( λ ) = E q ( Z ; λ ) log p ( X , Z ) − E q ( z ; λ ) log q ( Z ; λ ) = ∫ q ( Z ; λ ) log p ( X , Z ) d Z − ∫ q ( Z ; λ ) log q ( Z ; λ ) d Z = ∫ [ ∏ k = 1 K q k ( Z k ; λ k ) ] log p ( X , Z ) d Z − ∫ [ ∏ k = 1 K q k ( Z k ; λ k ) ] log q ( Z ; λ ) d Z \begin{aligned} ELBO(\lambda) &= \mathbb E_{q(Z;\lambda)}\text{log}p(X,Z) -\mathbb E_{q(z;\lambda)}\text{log}q(Z;\lambda) \\ &= \int q(Z;\lambda)\text{log}p(X,Z)dZ-\int q(Z;\lambda)\text{log}q(Z;\lambda)dZ\\ &=\int [\prod_{k=1}^{K}q_k(Z_k;\lambda_k)] \text{log}p(X,Z)dZ-\int [\prod_{k=1}^{K}q_k(Z_k;\lambda_k)] \text{log}q(Z;\lambda)dZ \end{aligned} ELBO(λ)=Eq(Z;λ)logp(X,Z)−Eq(z;λ)logq(Z;λ)=∫q(Z;λ)logp(X,Z)dZ−∫q(Z;λ)logq(Z;λ)dZ=∫[k=1∏Kqk(Zk;λk)]logp(X,Z)dZ−∫[k=1∏Kqk(Zk;λk)]logq(Z;λ)dZ
第一项为 energy
, 第二项为H(q)
符号的含义:
Z = { Z j , Z ‾ j } , Z ‾ j = Z \ Z j Z = \{Z_j,\overline Z_j \}, \overline Z_j=Z\backslash Z_j Z={Zj,Zj},Zj=Z\Zj
λ = { λ j , λ ‾ j } , λ ‾ j = λ \ λ j \lambda=\{\lambda_j, \overline\lambda_j\}, \overline \lambda_j=\lambda\backslash\lambda_j λ={λj,λj},λj=λ\λj
先处理第一项:
∫ [ ∏ k = 1 K q k ( Z k ; λ k ) ] log p ( X , Z ) d Z = ∫ Z j q j ( Z j ; λ j ) ∫ Z ‾ j [ ∏ k ≠ j K q k ( Z k ; λ k ) ] log p ( X , Z ) d Z ‾ j d Z j = ∫ Z j q j ( Z j ; λ j ) [ E q ( Z ‾ j ; λ ‾ j ) log p ( X , Z ) ] d Z j = ∫ Z j q j ( Z j ; λ j ) { log exp [ E q ( Z ‾ j ; λ ‾ j ) log p ( X , Z ) ] } d Z j = ∫ Z j q j ( Z j ; λ j ) [ log q j ∗ ( Z j ; λ j ) + log C ] d Z j \begin{aligned} &\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \text{log}p(X,Z)dZ = \\ &\int_{Z_j}q_j(Z_j;\lambda_j)\int_{ \overline Z_j}\Bigr[\prod_{k \neq j}^K q_k(Z_k;\lambda_k)\Bigr]\text{log}p(X,Z)d \overline Z_jdZ_j = \\ &\int_{Z_j}q_j(Z_j;\lambda_j)\Bigr[E_{q(\overline Z_j;\overline \lambda_j)}\text{log}p(X,Z)\Bigr]dZ_j=\\ &\int_{Z_j}q_j(Z_j;\lambda_j)\{\log \exp\Bigr[E_{q(\overline Z_j;\overline \lambda_j)}\text{log}p(X,Z)\Bigr]\}dZ_j=\\ &\int_{Z_j}q_j(Z_j;\lambda_j)\Bigr[\log q_j^* (Z_j;\lambda_j)+\log C\Bigr]dZ_j \end{aligned} ∫[k=1∏Kqk(Zk;λk)]logp(X,Z)dZ=∫Zjqj(Zj;λj)∫Zj[k̸=j∏Kqk(Zk;λk)]logp(X,Z)dZjdZj=∫Zjqj(Zj;λj)[Eq(Zj;λj)logp(X,Z)]dZj=∫Zjqj(Zj;λj){logexp[Eq(Zj;λj)logp(X,Z)]}dZj=∫Zjqj(Zj;λj)[logqj∗(Zj;λj)+logC]dZj
其中 q j ∗ ( Z j ; λ j ) = 1 C exp [ E q ( Z ‾ j ; λ ‾ j ) log p ( X , Z ) ] q_j^* (Z_j;\lambda_j)=\frac{1}{C}\exp[E_{q(\overline Z_j;\overline \lambda_j)}\text{log}p(X,Z)] qj∗(Zj;λj)=C1exp[Eq(Zj;λj)logp(X,Z)] , C C C 保证 q j ∗ ( Z j ; λ j ) q_j^* (Z_j;\lambda_j) qj∗(Zj;λj) 是一个分布。 C C C 与变分参数 λ ‾ j \overline \lambda_j λj 有关,与 Z , λ j Z, \lambda_j Z,λj 无关!!
再处理第二项:
∫ [ ∏ k = 1 K q k ( Z k ; λ k ) ] log q ( Z ; λ ) d Z = ∫ [ ∏ k = 1 K q k ( Z k ; λ k ) ] ∑ n = 1 K log q ( Z n ; λ ) d Z = ∑ j ∫ [ ∏ k = 1 K q k ( Z k ; λ k ) ] log q ( Z j ; λ j ) d Z = ∑ j ∫ [ ∏ k = 1 K q k ( Z k ; λ k ) ] log q ( Z j ; λ j ) d Z = ∑ j ∫ Z j q j ( Z j ; λ j ) log q ( Z j ; λ j ) d Z j ∫ [ ∏ k ≠ j K q k ( Z k ; λ k ) ] d Z ‾ j = ∑ j ∫ Z j q j ( Z j ; λ j ) log q ( Z j ; λ j ) d Z j \begin{aligned} &\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \text{log}q(Z;\lambda)dZ = \\ &\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \sum_{n=1}^K\text{log}q(Z_n;\lambda)dZ = \\ &\sum_j\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \text{log}q(Z_j;\lambda_j)dZ=\\ &\sum_j\int \Bigr[\prod_{k=1}^{K}q_k(Z_k;\lambda_k)\Bigr] \text{log}q(Z_j;\lambda_j)dZ=\\ &\sum_j\int_{Z_j} q_j(Z_j;\lambda_j)\text{log}q(Z_j;\lambda_j)dZ_j\int [\prod_{k\neq j}^{K}q_k(Z_k;\lambda_k)]d\overline Z_j=\\ &\sum_j\int_{Z_j} q_j(Z_j;\lambda_j)\text{log}q(Z_j;\lambda_j)dZ_j \end{aligned} ∫[k=1∏Kqk(Zk;λk)]logq(Z;λ)dZ=∫[k=1∏Kqk(Zk;λk)]n=1∑Klogq(Zn;λ)dZ=j∑∫[k=1∏Kqk(Zk;λk)]logq(Zj;λj)dZ=j∑∫[k=1∏Kqk(Zk;λk)]logq(Zj;λj)dZ=j∑∫Zjqj(Zj;λj)logq(Zj;λj)dZj∫[k̸=j∏Kqk(Zk;λk)]dZj=j∑∫Zjqj(Zj;λj)logq(Zj;λj)dZj
经过上面的处理,ELBO变为
E L B O = ∫ Z i q i ( Z i ; λ j ) log q i ∗ ( Z i ; λ i ) d Z i − ∑ j ∫ Z j q j ( Z j ; λ j ) log q ( Z j ; λ j ) d Z j + log C = { ∫ Z i q i ( Z i ; λ j ) log q i ∗ ( Z i ; λ i ) d Z i − ∫ Z i q i ( Z i ; λ j ) log q ( Z i ; λ i ) d Z i } + H ( q ( Z ‾ i ; λ ‾ i ) ) + log C \begin{aligned} ELBO &= \int_{Z_i}q_i(Z_i;\lambda_j)\text{log}q_i^* (Z_i;\lambda_i)dZ_i-\sum_j\int_{Z_j} q_j(Z_j;\lambda_j)\text{log}q(Z_j;\lambda_j)dZ_j+\log C\\ &=\{\int_{Z_i}q_i(Z_i;\lambda_j)\text{log}q_i^* (Z_i;\lambda_i)dZ_i-\int_{Z_i} q_i(Z_i;\lambda_j)\text{log}q(Z_i;\lambda_i)dZ_i\} +H(q(\overline Z_i;\overline \lambda_i))+\log C\\ & \end{aligned} ELBO=∫Ziqi(Zi;λj)logqi∗(Zi;λi)dZi−j∑∫Zjqj(Zj;λj)logq(Zj;λj)dZj+logC={∫Ziqi(Zi;λj)logqi∗(Zi;λi)dZi−∫Ziqi(Zi;λj)logq(Zi;λi)dZi}+H(q(Zi;λi))+logC
再看上式 { } \{\} {} 中的项:
∫ Z i q i ( Z i ; λ j ) log q i ∗ ( Z i ; λ i ) d Z i − ∫ Z i q i ( Z i ; λ j ) log q ( Z i ; λ i ) d Z i = − K L ( q i ( Z i ; λ j ) ∣ ∣ q i ∗ ( Z i ; λ i ) ) \int_{Z_i}q_i(Z_i;\lambda_j)\text{log}q_i^* (Z_i;\lambda_i)dZ_i-\int_{Z_i} q_i(Z_i;\lambda_j)\text{log}q(Z_i;\lambda_i)dZ_i = -KL(q_i(Z_i;\lambda_j)||q_i^* (Z_i;\lambda_i)) ∫Ziqi(Zi;λj)logqi∗(Zi;λi)dZi−∫Ziqi(Zi;λj)logq(Zi;λi)dZi=−KL(qi(Zi;λj)∣∣qi∗(Zi;λi))
所以ELBO又可以写成:
E L B O = − K L ( q i ( Z i ; λ j ) ∣ ∣ q i ∗ ( Z i ; λ i ) ) + H ( q ( Z ‾ i ; λ ‾ i ) ) + log C ELBO=-KL(q_i(Z_i;\lambda_j)||q_i^* (Z_i;\lambda_i))+H(q(\overline Z_i;\overline \lambda_i))+\log C ELBO=−KL(qi(Zi;λj)∣∣qi∗(Zi;λi))+H(q(Zi;λi))+logC
我们要 m a x m i z e E L B O maxmize ELBO maxmizeELBO,如何更新 q i ( Z i ; λ i ) q_i(Z_i;\lambda_i) qi(Zi;λi) 呢?
从
E L B O = − K L ( q i ( Z i ; λ i ) ∣ ∣ q i ∗ ( Z i ; λ i ) ) + H ( q ( Z ‾ i ; λ ‾ i ) ) + log C ELBO=-KL(q_i(Z_i;\lambda_i)||q_i^* (Z_i;\lambda_i))+H(q(\overline Z_i;\overline \lambda_i))+\log C ELBO=−KL(qi(Zi;λi)∣∣qi∗(Zi;λi))+H(q(Zi;λi))+logC
可以看出,当 q i ( Z i ; λ j ) = q i ∗ ( Z i ; λ i ) q_i(Z_i;\lambda_j)=q_i^* (Z_i;\lambda_i) qi(Zi;λj)=qi∗(Zi;λi) 时, K L ( q i ( Z i ; λ j ) ∣ ∣ q i ∗ ( Z i ; λ i ) ) = 0 KL(q_i(Z_i;\lambda_j)||q_i^* (Z_i;\lambda_i))=0 KL(qi(Zi;λj)∣∣qi∗(Zi;λi))=0 。 这时,ELBO取最大值。
所以参数更新策略就变成了
q 1 ( Z 1 ; λ 1 ) = q 1 ∗ ( Z 1 ; λ 1 ) q 2 ( Z 2 ; λ 2 ) = q 2 ∗ ( Z 2 ; λ 2 ) q 3 ( Z 3 ; λ 3 ) = q 3 ∗ ( Z 3 ; λ 3 ) . . . \begin{aligned} &q_1(Z_1;\lambda_1)=q_1^* (Z_1;\lambda_1)\\ &q_2(Z_2;\lambda_2)=q_2^* (Z_2;\lambda_2)\\ &q_3(Z_3;\lambda_3)=q_3^* (Z_3;\lambda_3)\\ &... \end{aligned} q1(Z1;λ1)=q1∗(Z1;λ1)q2(Z2;λ2)=q2∗(Z2;λ2)q3(Z3;λ3)=q3∗(Z3;λ3)...
关于 q i ∗ ( Z i ; λ i ) q_i^* (Z_i;\lambda_i) qi∗(Zi;λi)
q i ( Z i ; λ i ) = q i ∗ ( Z i ; λ i ) q i ( Z i ; λ i ) = 1 C exp [ E q ( Z ‾ i ; λ ‾ i ) log p ( X , Z ) ] = 1 C exp [ E q ( Z ‾ i ; λ ‾ i ) log p ( X , Z i , Z ‾ i ) ] \begin{aligned} q_i(Z_i;\lambda_i)&=q_i^* (Z_i;\lambda_i)\\ q_i (Z_i;\lambda_i)&=\frac{1}{C}\exp[E_{q(\overline Z_i;\overline \lambda_i)}\text{log}p(X,Z)]\\ &=\frac{1}{C}\exp[E_{q(\overline Z_i;\overline \lambda_i)}\text{log}p(X,Z_i,\overline Z_i)]\\ & \end{aligned} qi(Zi;λi)qi(Zi;λi)=qi∗(Zi;λi)=C1exp[Eq(Zi;λi)logp(X,Z)]=C1exp[Eq(Zi;λi)logp(X,Zi,Zi)]
q i q_i qi 是要更新的节点, X X X 是观测的数据,由于 Markov Blanket
(下面介绍),更新公式变成:
log ( q i ( Z i ; λ i ) ) = ∫ q ( m b ( Z i ) ) log p ( Z i , m b ( Z i ) , X ) d m b ( Z i ) \log(q_i(Z_i;\lambda_i))=\int q(mb(Z_i))\log p(Z_i,mb(Z_i),X)d~mb(Z_i) log(qi(Zi;λi))=∫q(mb(Zi))logp(Zi,mb(Zi),X)d mb(Zi)
由于式子中和 Z i Z_i Zi 无关的项都被积分积掉了,所以写成了 Markov Blanket
这种形式
In machine learning, the Markov blanket for a node A A A in a Bayesian network is the set of nodes m b ( A ) mb(A) mb(A) composed of A ′ s A's A′s parents, its children, and its children’s other parents. In a Markov random field, the Markov blanket of a node is its set of neighboring nodes.
Every set of nodes in the network is conditionally independent of A A A when conditioned on the set m b ( A ) mb(A) mb(A), that is, when conditioned on the Markov blanket of the node A A A . The probability has the Markov property; formally, for distinct nodes A A A and B B B:
P r ( A ∣ m b ( A ) , B ) = P r ( A ∣ m b ( A ) ) Pr(A|mb(A),B)=Pr(A|mb(A)) Pr(A∣mb(A),B)=Pr(A∣mb(A))
The Markov blanket of a node contains all the variables that shield the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge needed to predict the behavior of that node.
https://en.wikipedia.org/wiki/Markov_blanket
http://edwardlib.org/tutorials/inference
http://edwardlib.org/tutorials/variational-inference