公式编辑

一、 标题CSDN 中编写公式:

1.我们定义(行内公式:$...$) f ( x ) = ∑ i = 0 N ∫ a b g ( t , i )  d t f(x) = \sum_{i=0}^{N}\int_{a}^{b} g(t,i) \text{ d}t f(x)=i=0Nabg(t,i) dt.

2.定义 f ( x ) f(x) f(x)如下(行间公式):$$ ... $$
(1) f ( x ) = ∑ i = 0 N ∫ a b g ( t , i )  d t f(x) = \sum_{i=0}^{N}\int_{a}^{b} g(t,i) \text{ d}t\tag{1} f(x)=i=0Nabg(t,i) dt(1)

二、KL divergence

K L ( p ∣ ∣ q ) = ∑ p ( x ) log ⁡ p ( x ) q ( x ) K L ( p ∣ ∣ q ) = ∫ p ( x ) log ⁡ p ( x ) q ( x ) d x \begin{aligned} KL(p||q)=\sum p(x) \log \frac{p(x)}{q(x)} \\ KL(p||q) = \int p(x) \log \frac{p(x)}{q(x)}dx \end{aligned} KL(pq)=p(x)logq(x)p(x)KL(pq)=p(x)logq(x)p(x)dx

三、Variational Graph Auto-Encoders

K L ( q ( z ) ∣ ∣ p ( z ∣ X ) ) = ∫ q ( z ) log ⁡ q ( z ) p ( z ∣ X ) d z = ∫ q ( z ) [ log ⁡ q ( z ) − log ⁡ p ( z ∣ X ) ] d z = ∫ q ( z ) [ log ⁡ q ( z ) − log ⁡ p ( X ∣ z ) − log ⁡ ( z ) + log ⁡ p ( X ) ] d z = ∫ q ( z ) [ log ⁡ q ( z ) − log ⁡ p ( X ∣ z ) − log ⁡ p ( z ) ] d z + log ⁡ p ( X ) \begin{aligned} KL(q(z)||p(z|X)) &= \int q(z) \log \frac{q(z)}{p(z|X)}dz\\ &=\int q(z)[\log q(z)-\log p(z|X)]dz \\ &=\int q(z)[\log q(z)-\log p(X|z)- \log(z) + \log p(X)]dz\\ &=\int q(z)[\log q(z) -\log p(X|z) - \log p(z)]dz+\log p(X) \end{aligned} KL(q(z)p(zX))=q(z)logp(zX)q(z)dz=q(z)[logq(z)logp(zX)]dz=q(z)[logq(z)logp(Xz)log(z)+logp(X)]dz=q(z)[logq(z)logp(Xz)logp(z)]dz+logp(X)
log ⁡ p ( X ) − K L ( q ( z ) ∣ ∣ p ( z ∣ X ) ) = ∫ q ( z ) log ⁡ p ( X ∣ z ) d z − K L ( q ( z ) ∣ ∣ p ( z ) ) \begin{aligned} \log p(X) - KL(q(z)||p(z|X))= \int q(z)\log p(X|z)dz -KL(q(z)||p(z)) \end{aligned} logp(X)KL(q(z)p(zX))=q(z)logp(Xz)dzKL(q(z)p(z))
我们虽然不大容易求出p(X),但我们知道当X给定的情况下,p(X)是个固定值。那么如果我们希望KL(q(z)||p(z|X))尽可能地小,也就相当于让等号右边的那部分尽可能地大。

Given generative model
(2) p ( A ∣ Z ) = ∏ i = 1 N ∏ j = 1 N p ( A i j ∣ z i , z j ) p(A|Z)=\prod_{i=1}^{N}\prod_{j=1}^{N}p(A_{ij}|z_i,z_j)\tag{2} p(AZ)=i=1Nj=1Np(Aijzi,zj)(2),with (3) p ( A i j = 1 ∣ z i , z j ) = σ ( z i T z j ) p(A_{ij}=1|z_i,z_j)=\sigma(z_i^\Tau z_j)\tag{3} p(Aij=1zi,zj)=σ(ziTzj)(3)
Learning We optimize the variational lower bound L \mathcal L L w.r.t. the variatiomnal parameters W i W_i Wi
(4) L = E q ( Z ∣ X , A ) [ log ⁡ p ( A ∣ Z ) ] − K L [ q ( Z ∣ X , A ) ∣ ∣ p ( Z ) ] \mathcal L = \mathbb E_{q(Z|X,A)}[\log p(A|Z)]-KL[q(Z|X,A)||p(Z)]\tag{4} L=Eq(ZX,A)[logp(AZ)]KL[q(ZX,A)p(Z)](4)
where K L [ q ( ⋅ ) ∣ ∣ p ( ⋅ ) ] KL[q(\cdot)||p(\cdot)] KL[q()p()] is the Kullback-Leibler divergence between q ( ⋅ ) q(\cdot) q() and p ( ⋅ ) p(\cdot) p().We further take a Gaussian prior p ( Z ) = ∏ i p ( z i ) = ∏ i N ( z i ∣ 0 , 1 ) p(Z)=\prod_ip(z_i)=\prod_i\mathcal N(z_i|0,1) p(Z)=ip(zi)=iN(zi0,1). For very sparse A A A, it can be beneficial to re-weight terns with A i j = 1 A_{ij}=1 Aij=1 in L \mathcal L L or alternatively sub-sample terms with A i j = 0 A_{ij}=0 Aij=0.We choose the former for the following experiments. We perform full-batch gradient descent and make use of the reparameterization trick for training. For a featureless approach, we simply drop the dependence on X X X and replace X X X with the identity matrix in the GCN.
The Eq.4 second term on the right:
∫ q θ ( z ) log ⁡ p ( z ) d z = ∫ N ( z ; μ , σ 2 ) log ⁡ N ( z ; 0 , 1 ) d z = − J 2 log ⁡ ( 2 π ) − 1 2 ∑ j = 1 J ( μ j 2 + σ j 2 ) \begin{aligned} \int q_{\theta}(z) \log p(z) dz &= \int \mathcal N(z;\mu,\sigma^2) \log \mathcal N(z;0,1)dz\\ &=-\frac{J}{2} \log (2\pi)-\frac{1}{2}\sum _{j=1}^{J}(\mu_j^2+\sigma_j^2) \end{aligned}\\ qθ(z)logp(z)dz=N(z;μ,σ2)logN(z;0,1)dz=2Jlog(2π)21j=1J(μj2+σj2)
And
∫ q θ ( z ) log ⁡ q θ ( z ) d z = ∫ N ( z ; μ , σ 2 ) log ⁡ N ( z ; μ , σ 2 ) d z = − J 2 log ⁡ ( 2 π ) − 1 2 ∑ j = 1 J ( 1 + log ⁡ σ j 2 ) \begin{aligned} \int q_{\theta}(z) \log q_\theta(z) dz &= \int \mathcal N(z;\mu,\sigma^2) \log \mathcal N(z;\mu,\sigma^2)dz\\ &=-\frac{J}{2} \log (2\pi)-\frac{1}{2}\sum _{j=1}^{J}(1+\log \sigma_j^2) \end{aligned}\\ qθ(z)logqθ(z)dz=N(z;μ,σ2)logN(z;μ,σ2)dz=2Jlog(2π)21j=1J(1+logσj2)
Therefore:
− D K L ( ( q ϕ ( z ) ∣ ∣ p ϕ ( z ) ) = ∫ q θ ( z ) ( log ⁡ p θ ( z ) − log ⁡ q θ ( z ) ) d z = 1 2 ∑ j = 1 J ( 1 + log ⁡ ( ( σ j ) 2 ) − ( μ j ) 2 − ( σ j ) 2 ) \begin{aligned} -D_{KL}((q_\phi(z)||p_\phi(z))&=\int q_\theta(z)(\log p_{\theta}(z)-\log q_\theta(z))dz\\ &=\frac{1}{2}\sum _{j=1}^J(1+\log((\sigma_j)^2)-(\mu_j)^2-(\sigma_j)^2) \end{aligned}\\ DKL((qϕ(z)pϕ(z))=qθ(z)(logpθ(z)logqθ(z))dz=21j=1J(1+log((σj)2)(μj)2(σj)2)
Here,the variational lower bound (the objective to be maximized) contains a KL term that can often be integrated analytically. Here we give the solution when both the prior p θ ( z ) = N ( 0 ; I ) p_\theta(z) = \mathcal N(0; I) pθ(z)=N(0;I) and the posterior approximation q ϕ ( z ∣ x i ) q_\phi(z|x^i) qϕ(zxi) are Gaussian. Let J J J be the dimensionality of z. Let μ \mu μand σ \sigma σ denote the variational mean and s.d. evaluated at datapoint i i i, and let μ j \mu _j μj and σ j \sigma_j σj simply denote the j-th element of these vectors.
The Eq.4 first term on the right:
E q ( Z ∣ X , A ) [ log ⁡ p ( A ∣ Z ) ] = ∫ q ( Z ∣ X , A ) log ⁡ p ( A ∣ Z ) d Z = ∫ ∏ i = 1 N q ( z i ∣ X , A ) log ⁡ p ( A ∣ Z ) d Z = ∫ ∏ i = 1 N q ( z i ∣ X , A ) ∏ i N ∏ j N log ⁡ p ( A i j ∣ z i , z j ) d Z = ∫ ∏ i = 1 N N ( z i ∣ μ i , d i a g ( σ i 2 ) ) ∏ i = 1 N ∏ j = 1 N log ⁡ σ ( z i T z j ) d Z = ∫ ∏ i = 1 N N ( η i ∣ 0 , 1 ) f ( η i ; W ) d η i = 1 S ∑ s = 1 S f ( η i ; W ) \begin{aligned} \mathbb E_{q(Z|X,A)}[\log p(A|Z)] &= \int q(Z|X,A) \log p(A|Z) dZ \\ &= \int \prod_{i=1}^N q(z_i|X,A) \log p(A|Z) dZ \\ &=\int \prod_{i=1}^N q(z_i|X,A) \prod _i^N \prod _j^N \log p(A_{ij}|z_i,z_j)dZ\\ &=\int \prod_{i=1}^N \mathcal N(z_i| \mu_i, diag(\sigma_i^2) ) \prod _{i=1}^N \prod _{j=1}^N \log \sigma(z_i^\Tau z_j)dZ\\ &=\int \prod _{i=1} ^N \mathcal N(\eta_i|0,1)f(\eta_i ;W)d\eta_i\\ &=\frac{1}{S} \sum_{s=1}^{S} f(\eta_i;W) \end{aligned} Eq(ZX,A)[logp(AZ)]=q(ZX,A)logp(AZ)dZ=i=1Nq(ziX,A)logp(AZ)dZ=i=1Nq(ziX,A)iNjNlogp(Aijzi,zj)dZ=i=1NN(ziμi,diag(σi2))i=1Nj=1Nlogσ(ziTzj)dZ=i=1NN(ηi0,1)f(ηi;W)dηi=S1s=1Sf(ηi;W)

where
η i = ( z i − μ i ( W ) ) / σ i ( W ) ∼ N ( η ∣ 0 , 1 ) \begin{aligned} \eta_i=(z_i-\mu_i(W))/\sigma_i(W)\sim N(\eta|0,1) \end{aligned} ηi=(ziμi(W))/σi(W)N(η0,1)
E q ( Z ∣ X , A ) [ log ⁡ p ( y , A ∣ Z ) ] = E q ( Z ∣ X , A ) [ log ⁡ [ p ( y ∣ Z ) p ( A ∣ Z ) ] ] = E q ( Z ∣ X , A ) [ log ⁡ p ( y ∣ Z ) ] + E q ( Z ∣ X , A ) [ log ⁡ p ( A ∣ Z ) E q ( Z ∣ X , A ) [ log ⁡ p ( A ∣ Z ) ] ≈ 1 L ∑ l = 1 L ( log ⁡ p θ x ( i ) ∣ z ( i , l ) ) \begin{aligned} \mathbb{E}_{q(\mathbf{Z}\mid \mathbf{X},\mathbf{A})}\left[\log p(\mathbf{y},\mathbf{A}\mid \mathbf{Z})\right] &=\mathbb{E}_{q(\mathbf{Z}\mid \mathbf{X},\mathbf{A})}\left[\log [p(\mathbf{y}\mid \mathbf{Z})p(\mathbf{A} \mid \mathbf{Z})]\right]\\ &=\mathbb E_{q(\mathbf Z|\mathbf X,\mathbf A) } [\log p(\mathbf y|\mathbf Z)]+\mathbb E_{q(\mathbf Z|\mathbf X,\mathbf A) } [\log p(\mathbf A \mid \mathbf Z)\\ \mathbb E_{q(Z|X,A)}[\log p(A|Z)]& \approx \frac{1}{L} \sum_{l=1}^{L}(\log p_\theta x^{(i)}|z^{(i,l)})\\ \end{aligned} Eq(ZX,A)[logp(y,AZ)]Eq(ZX,A)[logp(AZ)]=Eq(ZX,A)[log[p(yZ)p(AZ)]]=Eq(ZX,A)[logp(yZ)]+Eq(ZX,A)[logp(AZ)L1l=1L(logpθx(i)z(i,l))
proof of The reparameterization trick
我们调用了另一种从 q ϕ ( z ∣ x ) q_\phi(z|x) qϕ(zx)生成样本的方法,基本参数化非常简单。让 z z z 作为一个连续的随机变量,并且服从 z ∼ q ϕ ( z ∣ x ) z \sim q_\phi(z|x) zqϕ(zx) 条件分布。通常可以将随机变量表示为确定性变量 z = g ϕ ( ϵ , x ) z=g_\phi(\epsilon,x) z=gϕ(ϵ,x),这里的 ϵ \epsilon ϵ 是一个具有独立边际的辅助变量 p ( ϵ ) p(\epsilon) p(ϵ).

你可能感兴趣的:(机器学习,CSDN中公式编写)