1.我们定义(行内公式:$...$
) f ( x ) = ∑ i = 0 N ∫ a b g ( t , i ) d t f(x) = \sum_{i=0}^{N}\int_{a}^{b} g(t,i) \text{ d}t f(x)=∑i=0N∫abg(t,i) dt.
2.定义 f ( x ) f(x) f(x)如下(行间公式):$$ ... $$
(1) f ( x ) = ∑ i = 0 N ∫ a b g ( t , i ) d t f(x) = \sum_{i=0}^{N}\int_{a}^{b} g(t,i) \text{ d}t\tag{1} f(x)=i=0∑N∫abg(t,i) dt(1)
K L ( p ∣ ∣ q ) = ∑ p ( x ) log p ( x ) q ( x ) K L ( p ∣ ∣ q ) = ∫ p ( x ) log p ( x ) q ( x ) d x \begin{aligned} KL(p||q)=\sum p(x) \log \frac{p(x)}{q(x)} \\ KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)}dx \end{aligned} KL(p∣∣q)=∑p(x)logq(x)p(x)KL(p∣∣q)=∫p(x)logq(x)p(x)dx
K L ( q ( z ) ∣ ∣ p ( z ∣ X ) ) = ∫ q ( z ) log q ( z ) p ( z ∣ X ) d z = ∫ q ( z ) [ log q ( z ) − log p ( z ∣ X ) ] d z = ∫ q ( z ) [ log q ( z ) − log p ( X ∣ z ) − log ( z ) + log p ( X ) ] d z = ∫ q ( z ) [ log q ( z ) − log p ( X ∣ z ) − log p ( z ) ] d z + log p ( X ) \begin{aligned} KL(q(z)||p(z|X)) &= \int q(z) \log \frac{q(z)}{p(z|X)}dz\\ &=\int q(z)[\log q(z)-\log p(z|X)]dz \\ &=\int q(z)[\log q(z)-\log p(X|z)- \log(z) + \log p(X)]dz\\ &=\int q(z)[\log q(z) -\log p(X|z) - \log p(z)]dz+\log p(X) \end{aligned} KL(q(z)∣∣p(z∣X))=∫q(z)logp(z∣X)q(z)dz=∫q(z)[logq(z)−logp(z∣X)]dz=∫q(z)[logq(z)−logp(X∣z)−log(z)+logp(X)]dz=∫q(z)[logq(z)−logp(X∣z)−logp(z)]dz+logp(X)
log p ( X ) − K L ( q ( z ) ∣ ∣ p ( z ∣ X ) ) = ∫ q ( z ) log p ( X ∣ z ) d z − K L ( q ( z ) ∣ ∣ p ( z ) ) \begin{aligned} \log p(X) - KL(q(z)||p(z|X))= \int q(z)\log p(X|z)dz -KL(q(z)||p(z)) \end{aligned} logp(X)−KL(q(z)∣∣p(z∣X))=∫q(z)logp(X∣z)dz−KL(q(z)∣∣p(z))
我们虽然不大容易求出p(X),但我们知道当X给定的情况下,p(X)是个固定值。那么如果我们希望KL(q(z)||p(z|X))尽可能地小,也就相当于让等号右边的那部分尽可能地大。
Given generative model
(2) p ( A ∣ Z ) = ∏ i = 1 N ∏ j = 1 N p ( A i j ∣ z i , z j ) p(A|Z)=\prod_{i=1}^{N}\prod_{j=1}^{N}p(A_{ij}|z_i,z_j)\tag{2} p(A∣Z)=i=1∏Nj=1∏Np(Aij∣zi,zj)(2),with (3) p ( A i j = 1 ∣ z i , z j ) = σ ( z i T z j ) p(A_{ij}=1|z_i,z_j)=\sigma(z_i^\Tau z_j)\tag{3} p(Aij=1∣zi,zj)=σ(ziTzj)(3)
Learning We optimize the variational lower bound L \mathcal L L w.r.t. the variatiomnal parameters W i W_i Wi:
(4) L = E q ( Z ∣ X , A ) [ log p ( A ∣ Z ) ] − K L [ q ( Z ∣ X , A ) ∣ ∣ p ( Z ) ] \mathcal L = \mathbb E_{q(Z|X,A)}[\log p(A|Z)]-KL[q(Z|X,A)||p(Z)]\tag{4} L=Eq(Z∣X,A)[logp(A∣Z)]−KL[q(Z∣X,A)∣∣p(Z)](4)
where K L [ q ( ⋅ ) ∣ ∣ p ( ⋅ ) ] KL[q(\cdot)||p(\cdot)] KL[q(⋅)∣∣p(⋅)] is the Kullback-Leibler divergence between q ( ⋅ ) q(\cdot) q(⋅) and p ( ⋅ ) p(\cdot) p(⋅).We further take a Gaussian prior p ( Z ) = ∏ i p ( z i ) = ∏ i N ( z i ∣ 0 , 1 ) p(Z)=\prod_ip(z_i)=\prod_i\mathcal N(z_i|0,1) p(Z)=∏ip(zi)=∏iN(zi∣0,1). For very sparse A A A, it can be beneficial to re-weight terns with A i j = 1 A_{ij}=1 Aij=1 in L \mathcal L L or alternatively sub-sample terms with A i j = 0 A_{ij}=0 Aij=0.We choose the former for the following experiments. We perform full-batch gradient descent and make use of the reparameterization trick for training. For a featureless approach, we simply drop the dependence on X X X and replace X X X with the identity matrix in the GCN.
The Eq.4 second term on the right:
∫ q θ ( z ) log p ( z ) d z = ∫ N ( z ; μ , σ 2 ) log N ( z ; 0 , 1 ) d z = − J 2 log ( 2 π ) − 1 2 ∑ j = 1 J ( μ j 2 + σ j 2 ) \begin{aligned} \int q_{\theta}(z) \log p(z) dz &= \int \mathcal N(z;\mu,\sigma^2) \log \mathcal N(z;0,1)dz\\ &=-\frac{J}{2} \log (2\pi)-\frac{1}{2}\sum _{j=1}^{J}(\mu_j^2+\sigma_j^2) \end{aligned}\\ ∫qθ(z)logp(z)dz=∫N(z;μ,σ2)logN(z;0,1)dz=−2Jlog(2π)−21j=1∑J(μj2+σj2)
And
∫ q θ ( z ) log q θ ( z ) d z = ∫ N ( z ; μ , σ 2 ) log N ( z ; μ , σ 2 ) d z = − J 2 log ( 2 π ) − 1 2 ∑ j = 1 J ( 1 + log σ j 2 ) \begin{aligned} \int q_{\theta}(z) \log q_\theta(z) dz &= \int \mathcal N(z;\mu,\sigma^2) \log \mathcal N(z;\mu,\sigma^2)dz\\ &=-\frac{J}{2} \log (2\pi)-\frac{1}{2}\sum _{j=1}^{J}(1+\log \sigma_j^2) \end{aligned}\\ ∫qθ(z)logqθ(z)dz=∫N(z;μ,σ2)logN(z;μ,σ2)dz=−2Jlog(2π)−21j=1∑J(1+logσj2)
Therefore:
− D K L ( ( q ϕ ( z ) ∣ ∣ p ϕ ( z ) ) = ∫ q θ ( z ) ( log p θ ( z ) − log q θ ( z ) ) d z = 1 2 ∑ j = 1 J ( 1 + log ( ( σ j ) 2 ) − ( μ j ) 2 − ( σ j ) 2 ) \begin{aligned} -D_{KL}((q_\phi(z)||p_\phi(z))&=\int q_\theta(z)(\log p_{\theta}(z)-\log q_\theta(z))dz\\ &=\frac{1}{2}\sum _{j=1}^J(1+\log((\sigma_j)^2)-(\mu_j)^2-(\sigma_j)^2) \end{aligned}\\ −DKL((qϕ(z)∣∣pϕ(z))=∫qθ(z)(logpθ(z)−logqθ(z))dz=21j=1∑J(1+log((σj)2)−(μj)2−(σj)2)
Here,the variational lower bound (the objective to be maximized) contains a KL term that can often be integrated analytically. Here we give the solution when both the prior p θ ( z ) = N ( 0 ; I ) p_\theta(z) = \mathcal N(0; I) pθ(z)=N(0;I) and the posterior approximation q ϕ ( z ∣ x i ) q_\phi(z|x^i) qϕ(z∣xi) are Gaussian. Let J J J be the dimensionality of z. Let μ \mu μand σ \sigma σ denote the variational mean and s.d. evaluated at datapoint i i i, and let μ j \mu _j μj and σ j \sigma_j σj simply denote the j-th element of these vectors.
The Eq.4 first term on the right:
E q ( Z ∣ X , A ) [ log p ( A ∣ Z ) ] = ∫ q ( Z ∣ X , A ) log p ( A ∣ Z ) d Z = ∫ ∏ i = 1 N q ( z i ∣ X , A ) log p ( A ∣ Z ) d Z = ∫ ∏ i = 1 N q ( z i ∣ X , A ) ∏ i N ∏ j N log p ( A i j ∣ z i , z j ) d Z = ∫ ∏ i = 1 N N ( z i ∣ μ i , d i a g ( σ i 2 ) ) ∏ i = 1 N ∏ j = 1 N log σ ( z i T z j ) d Z = ∫ ∏ i = 1 N N ( η i ∣ 0 , 1 ) f ( η i ; W ) d η i = 1 S ∑ s = 1 S f ( η i ; W ) \begin{aligned} \mathbb E_{q(Z|X,A)}[\log p(A|Z)] &= \int q(Z|X,A) \log p(A|Z) dZ \\ &= \int \prod_{i=1}^N q(z_i|X,A) \log p(A|Z) dZ \\ &=\int \prod_{i=1}^N q(z_i|X,A) \prod _i^N \prod _j^N \log p(A_{ij}|z_i,z_j)dZ\\ &=\int \prod_{i=1}^N \mathcal N(z_i| \mu_i, diag(\sigma_i^2) ) \prod _{i=1}^N \prod _{j=1}^N \log \sigma(z_i^\Tau z_j)dZ\\ &=\int \prod _{i=1} ^N \mathcal N(\eta_i|0,1)f(\eta_i ;W)d\eta_i\\ &=\frac{1}{S} \sum_{s=1}^{S} f(\eta_i;W) \end{aligned} Eq(Z∣X,A)[logp(A∣Z)]=∫q(Z∣X,A)logp(A∣Z)dZ=∫i=1∏Nq(zi∣X,A)logp(A∣Z)dZ=∫i=1∏Nq(zi∣X,A)i∏Nj∏Nlogp(Aij∣zi,zj)dZ=∫i=1∏NN(zi∣μi,diag(σi2))i=1∏Nj=1∏Nlogσ(ziTzj)dZ=∫i=1∏NN(ηi∣0,1)f(ηi;W)dηi=S1s=1∑Sf(ηi;W)
where
η i = ( z i − μ i ( W ) ) / σ i ( W ) ∼ N ( η ∣ 0 , 1 ) \begin{aligned} \eta_i=(z_i-\mu_i(W))/\sigma_i(W)\sim N(\eta|0,1) \end{aligned} ηi=(zi−μi(W))/σi(W)∼N(η∣0,1)
E q ( Z ∣ X , A ) [ log p ( y , A ∣ Z ) ] = E q ( Z ∣ X , A ) [ log [ p ( y ∣ Z ) p ( A ∣ Z ) ] ] = E q ( Z ∣ X , A ) [ log p ( y ∣ Z ) ] + E q ( Z ∣ X , A ) [ log p ( A ∣ Z ) E q ( Z ∣ X , A ) [ log p ( A ∣ Z ) ] ≈ 1 L ∑ l = 1 L ( log p θ x ( i ) ∣ z ( i , l ) ) \begin{aligned} \mathbb{E}_{q(\mathbf{Z}\mid \mathbf{X},\mathbf{A})}\left[\log p(\mathbf{y},\mathbf{A}\mid \mathbf{Z})\right] &=\mathbb{E}_{q(\mathbf{Z}\mid \mathbf{X},\mathbf{A})}\left[\log [p(\mathbf{y}\mid \mathbf{Z})p(\mathbf{A} \mid \mathbf{Z})]\right]\\ &=\mathbb E_{q(\mathbf Z|\mathbf X,\mathbf A) } [\log p(\mathbf y|\mathbf Z)]+\mathbb E_{q(\mathbf Z|\mathbf X,\mathbf A) } [\log p(\mathbf A \mid \mathbf Z)\\ \mathbb E_{q(Z|X,A)}[\log p(A|Z)]& \approx \frac{1}{L} \sum_{l=1}^{L}(\log p_\theta x^{(i)}|z^{(i,l)})\\ \end{aligned} Eq(Z∣X,A)[logp(y,A∣Z)]Eq(Z∣X,A)[logp(A∣Z)]=Eq(Z∣X,A)[log[p(y∣Z)p(A∣Z)]]=Eq(Z∣X,A)[logp(y∣Z)]+Eq(Z∣X,A)[logp(A∣Z)≈L1l=1∑L(logpθx(i)∣z(i,l))
proof of The reparameterization trick
我们调用了另一种从 q ϕ ( z ∣ x ) q_\phi(z|x) qϕ(z∣x)生成样本的方法,基本参数化非常简单。让 z z z 作为一个连续的随机变量,并且服从 z ∼ q ϕ ( z ∣ x ) z \sim q_\phi(z|x) z∼qϕ(z∣x) 条件分布。通常可以将随机变量表示为确定性变量 z = g ϕ ( ϵ , x ) z=g_\phi(\epsilon,x) z=gϕ(ϵ,x),这里的 ϵ \epsilon ϵ 是一个具有独立边际的辅助变量 p ( ϵ ) p(\epsilon) p(ϵ).