DDPM交叉熵损失函数推导

K L \rm KL KL散度

由于以下推导需要用到 K L \rm KL KL散度,这里先简单介绍一下。
K L \rm KL KL散度一般用于度量两个概率分布函数之间的“距离”,其定义如下:
K L [ P ( X ) ∣ ∣ Q ( X ) ] = ∑ x ∈ X [ P ( x ) log ⁡ P ( x ) Q ( x ) ] = E x ∼ P ( x ) [ log ⁡ P ( x ) Q ( x ) ] KL\big[P(X)||Q(X)\big]=\sum_{x\in X}\Big[P(x)\log\frac{P(x)}{Q(x)}\Big]=E_{x\sim P(x)}\Big[\log\frac{P(x)}{Q(x)}\Big] KL[P(X)Q(X)]=xX[P(x)logQ(x)P(x)]=ExP(x)[logQ(x)P(x)]
这里 P ( X ) P(X) P(X) Q ( X ) Q(X) Q(X)是两个概率分布函数,可以看到对于离散型随机变量, K L \rm KL KL散度对 x x x进行求和;对于连续型随机变量, K L \rm KL KL散度对 x x x进行积分(期望)。
高斯分布的 K L \rm KL KL散度
对于两个单一变量的高斯分布 p ∼ N ( μ 1 , σ 1 2 ) p\sim\mathcal{N}(\mu_1, \sigma_1^2) pN(μ1,σ12) q ∼ N ( μ 2 , σ 2 2 ) q\sim\mathcal{N}(\mu_2,\sigma_2^2) qN(μ2,σ22)而言,它们的KL散度为
K L ( p , q ) = log ⁡ σ 2 σ 1 + σ 1 2 + ( μ 1 − μ 2 ) 2 2 σ 2 2 − 1 2 KL(p,q)=\log\frac{\sigma_2}{\sigma_1}+\frac{\sigma_1^2+(\mu_1-\mu_2)^2}{2\sigma_2^2}-\frac{1}{2} KL(p,q)=logσ1σ2+2σ22σ12+(μ1μ2)221

似然函数

下方是论文中给出的后向过程 x t − 1 \mathbf{x}_{t-1} xt1的分布,其方差为常数。
p θ ( x 0 : T ) = p ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) , p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , ∑ θ ( x t , t ) ) p_{\theta}(\mathbf{x}_{0:T})=p(\mathbf{x}_T)\prod_{t=1}^T p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t),\qquad p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)=\mathcal{N}(\mathbf{x}_{t-1};\mu_{\theta}(\mathbf{x}_t,t),\sum_{\theta}(\mathbf{x}_t,t)) pθ(x0:T)=p(xT)t=1Tpθ(xt1xt),pθ(xt1xt)=N(xt1;μθ(xt,t),θ(xt,t))
推出扩散模型目标数据分布的似然函数,推出似然函数后才能优化模型。 p θ ( x 0 ) p_{\theta}(\mathbf{x}_0) pθ(x0)为目标数据分布,其对数似然下界越大,那么对数似然越大。为了方便推导,这里用其负对数似然 − log ⁡ p θ ( x 0 ) -\log p_{\theta}(\mathbf{x}_0) logpθ(x0)推导,其上界越小,负对数似然越小,相对应其对数似然越大。
− log ⁡ p θ ( x 0 ) ≤ − log ⁡ p θ ( x 0 ) + D K L ( q ( x 1 : T ∣ x 0 ) ∥ p θ ( x 1 : T ∣ x 0 ) ) ( 1 ) = − log ⁡ p θ ( x 0 ) + E x 1 : T ∼ q ( x 1 : T ∣ x 0 ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) / p θ ( x 0 ) ] ( 2 ) = − log ⁡ p θ ( x 0 ) + E q [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) + log ⁡ p θ ( x 0 ) ] ( 3 ) = E q ( x 1 : T ∣ x 0 ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] ( 4 ) \begin{aligned} -\log p_{\theta}(\mathbf{x}_0) & \leq -\log p_{\theta}(\mathbf{x}_0)+D_{KL}(q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)\parallel p_{\theta}(\mathbf{x}_{1:T}\mid\mathbf{x}_0)) \qquad(1)\\ & = -\log p_{\theta}(\mathbf{x}_0)+\Bbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})/p_{\theta}(\mathbf{x}_0)}\Big] \quad(2)\\ & = -\log p_{\theta}(\mathbf{x}_0)+\Bbb{E}_q\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})}+\log p_{\theta}(\mathbf{x}_0)\Big]\qquad(3)\\ & = \Bbb{E}_{q(\mathbf{x}_{1:T}\mid\mathbf{\mathbf{x}_0})}\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})}\Big]\qquad(4) \end{aligned} logpθ(x0)logpθ(x0)+DKL(q(x1:Tx0)pθ(x1:Tx0))(1)=logpθ(x0)+Ex1:Tq(x1:Tx0)[logpθ(x0:T)/pθ(x0)q(x1:Tx0)](2)=logpθ(x0)+Eq[logpθ(x0:T)q(x1:Tx0)+logpθ(x0)](3)=Eq(x1:Tx0)[logpθ(x0:T)q(x1:Tx0)](4)

公式推导
  • ( 1 ) (1) (1) : 不等式右边加上一个 K L \rm KL KL散度,由于 K L \rm KL KL散度始终大于等于0,所以不等号成立。也即不等式右边是左边的上界,我们只需要优化右边的式子使其达到最小,那么等式左边的对数似然就达到最小。
  • ( 1 ) → ( 2 ) (1)\rightarrow(2) (1)(2) : 这一步是将 K L \rm KL KL散度展开,可以见上方 K L \rm KL KL散度的定义,定义中 P ( x ) P(x) P(x)相当于 q ( x 1 : T ∣ x 0 ) q(\mathbf{x}_{1:T}\mid\mathbf{x}_0) q(x1:Tx0) Q ( x ) Q(x) Q(x)相当于 p θ ( x 1 : T ∣ x 0 ) p_{\theta}(\mathbf{x}_{1:T}\mid\mathbf{x}_0) pθ(x1:Tx0)。将 Q ( x ) Q(x) Q(x)按照条件概率公式展开: p θ ( x 1 : T ∣ x 0 ) = p θ ( x 1 : T , x 0 ) / p θ ( x 0 ) = p θ ( x 0 : T ) / p θ ( x 0 ) p_{\theta}(\mathbf{x}_{1:T}\mid\mathbf{x}_0)=p_{\theta}(\mathbf{x}_{1:T},\mathbf{x}_0)/p_{\theta}(\mathbf{x}_0)=p_{\theta}(\mathbf{x}_{0:T})/p_{\theta}(\mathbf{x}_0) pθ(x1:Tx0)=pθ(x1:T,x0)/pθ(x0)=pθ(x0:T)/pθ(x0),这样就得到了第 ( 2 ) (2) (2)步的式子。
  • ( 2 ) → ( 3 ) (2)\rightarrow(3) (2)(3) : 将 log ⁡ \log log进行展开即可。
  • ( 3 ) → ( 4 ) (3)\rightarrow(4) (3)(4) : 由于该期望是针对分布 q q q的,则 log ⁡ p θ ( x 0 ) \log p_{\theta}(\mathbf{x}_0) logpθ(x0)相对于 q q q就是常数。所以 E q [ log ⁡ p θ ( x 0 ) ] = log ⁡ p θ ( x 0 ) \Bbb{E}_q\big[\log p_{\theta}(\mathbf{x}_0)\big]=\log p_{\theta}(\mathbf{x}_0) Eq[logpθ(x0)]=logpθ(x0),然后和前面的 − log ⁡ p θ ( x 0 ) -\log p_{\theta}(\mathbf{x}_0) logpθ(x0)约去,就得到了式子 ( 4 ) (4) (4)
推导结束

然后我们将不等式左边的 − log ⁡ p θ ( x 0 ) -\log p_{\theta}(\mathbf{x}_0) logpθ(x0)套上一个关于分布 q ( x 0 ) q(\mathbf{x}_0) q(x0)的期望,得到 − E q ( x 0 ) log ⁡ p θ ( x 0 ) -\Bbb{E}_{q(\mathbf{x}_0)}\log p_{\theta}(\mathbf{x}_0) Eq(x0)logpθ(x0)(交叉熵,也即loss);相应的,不等式右边也要加上一个 x 0 \mathbf{x}_0 x0,则由 E q ( x 1 : T ∣ x 0 ) \Bbb{E}_{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)} Eq(x1:Tx0)变为 E q ( x 0 : T ) \Bbb{E}_{q(\mathbf{x}_{0:T})} Eq(x0:T)。如果我们想最小化loss,也就是最小化 E q ( x 0 : T ) \Bbb{E}_{q(\mathbf{x}_{0:T})} Eq(x0:T)
L e t   L V L B = E q ( x 0 : T ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] ≥ − E q ( x 0 ) log ⁡ p θ ( x 0 ) \rm Let\text{ }\it L_{\rm VLB} \it = \Bbb{E}_{q(\mathbf{x}_{0:T})}\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})}\Big]\geq -\Bbb{E}_{q(\mathbf{x}_0)}\log p_{\theta}(\mathbf{x}_0) Let LVLB=Eq(x0:T)[logpθ(x0:T)q(x1:Tx0)]Eq(x0)logpθ(x0)

化简loss上界

L V L B = E q ( x 0 : T ) [ log ⁡ q ( x 1 : T ∣ x 0 ) p θ ( x 0 : T ) ] ( 1 ) = E [ log ⁡ ∏ t = 1 T q ( x t ∣ x t − 1 ) p θ ( x T ) ∏ t = 1 T p θ ( x t − 1 ∣ x t ) ] ( 2 ) = E q [ − log ⁡ p θ ( x T ) + ∑ t = 1 T log ⁡ q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) ] ( 3 ) = E q [ − log ⁡ p θ ( x T ) + ∑ t = 2 T log ⁡ q ( x t ∣ x t − 1 ) p θ ( x t − 1 ∣ x t ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] ( 4 ) = E q [ − log ⁡ p θ ( x T ) + ∑ t = 2 T log ⁡ ( q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) ⋅ q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] ( 5 ) = E q [ − log ⁡ p θ ( x T ) + ∑ t = 2 T log ⁡ q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + ∑ t = 2 T log ⁡ q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] ( 6 ) = E q [ − log ⁡ p θ ( x T ) + ∑ t = 2 T log ⁡ q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) + log ⁡ q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) ] ( 7 ) = E q [ log ⁡ q ( x T ∣ x 0 ) p θ ( x T ) + ∑ t = 2 T log ⁡ q ( x t − 1 ∣ x t , x 0 ) p θ ( x t − 1 ∣ x t ) − log ⁡ p θ ( x 0 ∣ x 1 ) ] ( 8 ) = E q [ D K L ( q ( x T ∣ x 0 ) ∥ p θ ( x T ) ) ⏟ L T + ∑ t = 2 T D K L ( q ( x t − 1 ∣ x t , x 0 ) ∥ p θ ( x t − 1 ∣ x t ) ) ⏟ L t − 1 − log ⁡ p θ ( x 0 ∣ x 1 ) ⏟ L 0 ] ( 9 ) \begin{aligned} L_{\rm VLB} \it & = \Bbb{E}_{q(\mathbf{x}_{0:T})}\Big[\log\frac{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{0:T})}\Big] \qquad (1)\\ & = \Bbb{E}\Big[\log\frac{\prod_{t=1}^Tq(\mathbf{x}_t\mid\mathbf{x}_{t-1})}{p_{\theta}(\mathbf{x}_T)\prod_{t=1}^Tp_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}\Big] \qquad(2)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\mid\mathbf{x}_{t-1})}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)} \Big] \qquad(3)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\mid\mathbf{x}_{t-1})}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} \Big] \qquad(4)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=2}^T \log \Big(\frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) } \cdot \frac{q(\mathbf{x}_t\mid\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)} \Big) + \log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} \Big] \qquad(5)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) } + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\mid\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}+\log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} \Big] \qquad(6)\\ & = \Bbb{E}_q \Big[-\log p_{\theta}(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) } + \log \frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{q(\mathbf{x}_{1}\mid\mathbf{x}_0)}+\log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} \Big] \qquad(7)\\ & = \Bbb{E}_q \Big[\log \frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) } - \log p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1) \Big] \qquad(8)\\ & = \Bbb{E}_q[\underbrace{D_{\rm KL}(q(\mathbf{x}_T\mid\mathbf{x}_0)\parallel p_{\theta}(\mathbf{x}_T))}_{L_T}+\sum_{t=2}^T\underbrace{D_{\rm KL}(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\parallel p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t))}_{L_{t-1}}-\underbrace{\log p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)}_{L_0}]\qquad(9) \end{aligned} LVLB=Eq(x0:T)[logpθ(x0:T)q(x1:Tx0)](1)=E[logpθ(xT)t=1Tpθ(xt1xt)t=1Tq(xtxt1)](2)=Eq[logpθ(xT)+t=1Tlogpθ(xt1xt)q(xtxt1)](3)=Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xtxt1)+logpθ(x0x1)q(x1x0)](4)=Eq[logpθ(xT)+t=2Tlog(pθ(xt1xt)q(xt1xt,x0)q(xt1x0)q(xtx0))+logpθ(x0x1)q(x1x0)](5)=Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xt1xt,x0)+t=2Tlogq(xt1x0)q(xtx0)+logpθ(x0x1)q(x1x0)](6)=Eq[logpθ(xT)+t=2Tlogpθ(xt1xt)q(xt1xt,x0)+logq(x1x0)q(xTx0)+logpθ(x0x1)q(x1x0)](7)=Eq[logpθ(xT)q(xTx0)+t=2Tlogpθ(xt1xt)q(xt1xt,x0)logpθ(x0x1)](8)=Eq[LT DKL(q(xTx0)pθ(xT))+t=2TLt1 DKL(q(xt1xt,x0)pθ(xt1xt))L0 logpθ(x0x1)](9)

公式推导
  • ( 1 ) → ( 2 ) (1)\rightarrow(2) (1)(2) : 将条件概率展开。由于 q ( x 1 : T ∣ x 0 ) q(\mathbf{x}_{1:T}\mid\mathbf{x}_0) q(x1:Tx0)是扩散过程,是从 x 0 \mathbf{x}_0 x0逐步推导 x T \mathbf{x}_T xT得到过程,其符合马尔科夫假设,故 q ( x 1 : T ∣ x 0 ) = q ( x 1 ∣ x 0 ) ⋅ q ( x 2 ∣ x 1 ) ⋅ . . . ⋅ q ( x T ∣ x T − 1 ) = ∏ t = 1 T q ( x t ∣ x t − 1 ) q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)=q(\mathbf{x}_1\mid\mathbf{x}_0)\cdot q(\mathbf{x}_2\mid\mathbf{x}_1)\cdot ... \cdot q(\mathbf{x}_T\mid\mathbf{x}_{T-1})=\prod_{t=1}^Tq(\mathbf{x}_t\mid\mathbf{x}_{t-1}) q(x1:Tx0)=q(x1x0)q(x2x1)...q(xTxT1)=t=1Tq(xtxt1);对于 p θ ( x 0 : T ) p_{\theta}(\mathbf{x}_{0:T}) pθ(x0:T),我们先将其根据条件概率转换为 p θ ( x T ) p θ ( x 0 : T − 1 ∣ x T ) p_{\theta}(\mathbf{x}_T)p_{\theta}(\mathbf{x}_{0:T-1}\mid\mathbf{x}_T) pθ(xT)pθ(x0:T1xT),然后将后面那一项和 q q q一样,展开即可。
  • ( 2 ) → ( 3 ) (2)\rightarrow(3) (2)(3) : 将 log ⁡ \log log进行展开,连乘展开后转换为求和。
  • ( 3 ) → ( 4 ) (3)\rightarrow(4) (3)(4) : 将 log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) \log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} logpθ(x0x1)q(x1x0)单独拿出来计算。
  • ( 4 ) → ( 5 ) (4)\rightarrow(5) (4)(5) : 回忆一下,之前在讲逆扩散过程的时候我们得到了这样一个式子 q ( x t − 1 ∣ x t , x 0 ) = q ( x t ∣ x t − 1 ) q ( x t − 1 ∣ x 0 ) q ( x t ∣ x 0 ) q(\mathbf{x}_{t-1}\mid\mathbf{x}_{t},\mathbf{x}_0)=q(\mathbf{x}_{t}\mid\mathbf{x}_{t-1})\frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}{q(\mathbf{x}_{t}\mid\mathbf{x}_0)} q(xt1xt,x0)=q(xtxt1)q(xtx0)q(xt1x0),通过这个式子,我们就能得到 q ( x t ∣ x t − 1 ) q(\mathbf{x}_t\mid\mathbf{x}_{t-1}) q(xtxt1)的表达式,然后替换即可。
  • ( 5 ) → ( 6 ) (5)\rightarrow(6) (5)(6) : 将 log ⁡ \log log进行展开。
  • ( 6 ) → ( 7 ) (6)\rightarrow(7) (6)(7) : ∑ t = 2 T log ⁡ q ( x t ∣ x 0 ) q ( x t − 1 ∣ x 0 ) = log ⁡ ( q ( x 2 ∣ x 0 ) q ( x 1 ∣ x 0 ) ⋅ q ( x 3 ∣ x 0 ) q ( x 2 ∣ x 0 ) ⋅ . . . ⋅ q ( x T ∣ x 0 ) q ( x T − 1 ∣ x 0 ) ) = log ⁡ q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) \sum_{t=2}^T\log\frac{q(\mathbf{x}_t\mid\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}=\log\Big(\frac{q(\mathbf{x}_2\mid\mathbf{x}_0)}{q(\mathbf{x}_1\mid\mathbf{x}_0)}\cdot\frac{q(\mathbf{x}_3\mid\mathbf{x}_0)}{q(\mathbf{x}_2\mid\mathbf{x}_0)}\cdot...\cdot\frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{q(\mathbf{x}_T-1\mid\mathbf{x}_0)}\Big)=\log\frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} t=2Tlogq(xt1x0)q(xtx0)=log(q(x1x0)q(x2x0)q(x2x0)q(x3x0)...q(xT1x0)q(xTx0))=logq(x1x0)q(xTx0)
  • ( 7 ) → ( 8 ) (7)\rightarrow(8) (7)(8) : log ⁡ q ( x T ∣ x 0 ) q ( x 1 ∣ x 0 ) + log ⁡ q ( x 1 ∣ x 0 ) p θ ( x 0 ∣ x 1 ) = log ⁡ q ( x T ∣ x 0 ) − log ⁡ p θ ( x 0 ∣ x 1 ) \log\frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)}=\log q(\mathbf{x}_T\mid\mathbf{x}_0)-\log p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1) logq(x1x0)q(xTx0)+logpθ(x0x1)q(x1x0)=logq(xTx0)logpθ(x0x1),然后将 log ⁡ q ( x T ∣ x 0 ) \log q(\mathbf{x}_T\mid\mathbf{x}_0) logq(xTx0) − log ⁡ p θ ( x T ) -\log p_{\theta}(\mathbf{x}_T) logpθ(xT)合并成 log ⁡ q ( x T ∣ x 0 ) p θ ( x T ) \log \frac{q(\mathbf{x}_T\mid\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_T)} logpθ(xT)q(xTx0)
  • ( 8 ) → ( 9 ) (8)\rightarrow(9) (8)(9) : 对于 L T L_T LT q ( x T ∣ x 0 ) q(\mathbf{x}_T\mid\mathbf{x}_0) q(xTx0) p θ ( x T ) p_{\theta}(\mathbf{x}_T) pθ(xT)都是不含参的,前者 q q q分布是由 β t \beta_t βt求出的,不含有任何参数;后者是一个各向同性的高斯分布。故 L T L_T LT是不含参的,在优化时可以将其舍弃。对于 L t − 1 L_{t-1} Lt1,参见 K L \rm KL KL散度定义,可以将其表示为 K L \rm KL KL散度,如果这里我们将 t t t取1,其转化为 log ⁡ q ( x 0 ∣ x 1 , x 0 ) p θ ( x 0 ∣ x 1 ) = log ⁡ 1 p θ ( x 0 ∣ x 1 ) \log\frac{q(\mathbf{x}_0\mid\mathbf{x}_1,\mathbf{x}_0)}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)}=\log\frac{1}{p_{\theta}(\mathbf{x}_0\mid\mathbf{x}_1)} logpθ(x0x1)q(x0x1,x0)=logpθ(x0x1)1。故当 t t t为1时,得到的结果就是 L t − 1 L_{t-1} Lt1后面那一项 L 0 L_0 L0,故我们可以将其合并。故我们只需要优化 L t − 1 L_{t-1} Lt1即可。
推导结束

在论文中,作者将分布 p θ ( x t − 1 ∣ x t ) p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) pθ(xt1xt)的方差看作与 β \beta β相关的常数,那么可训练的参数就存在于其均值当中。在 L t − 1 L_{t-1} Lt1中, q ( x t − 1 ∣ x t , x 0 ) q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) q(xt1xt,x0)是一个高斯分布,其方差和均值我们已经在之前后向过程推导中求出,均值为 μ ~ t ( x t ) \tilde{\mu}_t(\mathbf{x}_t) μ~t(xt),方差为和 β t \beta_t βt有关的常数。而 p θ ( x t − 1 ∣ x t ) p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t) pθ(xt1xt)也是我们假设的高斯分布,它的方差也是常数,均值为 μ θ ( x t , t ) \mu_{\theta}(\mathbf{x}_t,t) μθ(xt,t),所以参数只在 μ θ \mu_{\theta} μθ当中。对于这两个高斯分布,我们可以运用高斯分布的 K L \rm KL KL散度公式,其中的方差我们可以不考虑。则我们可以得到如下的式子:
L t − 1 = E q [ 1 2 σ t 2 ∥ μ ~ t ( x t , x 0 ) − μ θ ( x t , t ) ∥ 2 ] + C L_{t-1}=\Bbb{E}_q \Big[\frac{1}{2\sigma_t^2} \lVert \tilde{\mu}_t(\mathbf{x}_t,\mathbf{x}_0)-\mu_{\theta}(\mathbf{x}_t,t)\rVert^2 \Big]+C Lt1=Eq[2σt21μ~t(xt,x0)μθ(xt,t)2]+C

由这个式子,我们优化目标就很明确了,我们要优化 μ θ \mu_{\theta} μθ,让其无线逼近于 μ ~ t \tilde{\mu}_t μ~t,这样才能使 L t − 1 L_{t-1} Lt1最小。首先我们将 μ ~ t ( x t ) \tilde{\mu}_t(\mathbf{x}_t) μ~t(xt)代入上述的式子中,原式中的 z ~ t \tilde{z}_t z~t ϵ \epsilon ϵ来表示, x t \mathbf{x}_t xt x t ( x 0 , ϵ ) \mathbf{x}_t(\mathbf{x}_0,\epsilon) xt(x0,ϵ)替换,就能得到下方第二个等号的式子。
L t − 1 − C = E x 0 , ϵ [ 1 2 σ t 2 ∥ μ ~ t ( x t ( x 0 , ϵ ) , 1 α ˉ t ( x t ( x 0 , ϵ ) − 1 − α ˉ t ϵ ) ) − μ θ ( x t ( x 0 , ϵ ) , t ) ∥ 2 ] = E x 0 , ϵ [ 1 2 σ t 2 ∥ 1 α t ( x t ( x 0 , ϵ ) − β t 1 − α ˉ t ϵ ) − μ θ ( x t ( x 0 , ϵ ) , t ) ∥ 2 ] \begin{aligned} L_{t-1}-C & = \Bbb{E}_{\mathbf{x}_0,\epsilon} \Bigg[\frac{1}{2\sigma_t^2}\Big\lVert\tilde{\mu}_t\Big(\mathbf{x}_t(\mathbf{x}_0,\epsilon),\frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t(\mathbf{x}_0,\epsilon)-\sqrt{1-\bar{\alpha}_t}\epsilon)\Big)-\mu_{\theta}(\mathbf{x}_t(\mathbf{x}_0,\epsilon),t)\Big\rVert^2 \Bigg] \\ & = \Bbb{E}_{\mathbf{x}_0,\epsilon} \Bigg[\frac{1}{2\sigma_t^2}\Big\lVert\frac{1}{\sqrt{\alpha}_t}\Big(\mathbf{x}_t(\mathbf{x}_0,\epsilon)-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon \Big)-\mu_{\theta}(\mathbf{x}_t(\mathbf{x}_0,\epsilon),t)\Big\rVert^2 \Bigg] \end{aligned} Lt1C=Ex0,ϵ[2σt21μ~t(xt(x0,ϵ),αˉt 1(xt(x0,ϵ)1αˉt ϵ))μθ(xt(x0,ϵ),t)2]=Ex0,ϵ[2σt21α t1(xt(x0,ϵ)1αˉt βtϵ)μθ(xt(x0,ϵ),t)2]
这里我们的 x t \mathbf{x}_t xt是已知的,那么为了使 L t − 1 L_{t-1} Lt1最小,我们可以将 μ θ ( x t , t ) \mu_{\theta}(\mathbf{x}_t,t) μθ(xt,t)表示为 μ ~ t \tilde{\mu}_t μ~t的一个波动,其中的 ϵ \epsilon ϵ是未知的,则我们可以训练一个网络来预测 ϵ \epsilon ϵ
μ θ ( x t , t ) = μ ~ t ( x t , 1 α ˉ t ( x t − 1 − α ˉ t ϵ θ ( x t ) ) ) = 1 α t ( x t − β t 1 − α ˉ t ϵ θ ( x t , t ) ) \mu_{\theta}(\mathbf{x}_t,t)=\tilde{\mu}_t\Big(\mathbf{x}_t,\frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\epsilon_{\theta}(\mathbf{x}_t)}) \Big)=\frac{1}{\sqrt{\alpha_t}}\Big(\mathbf{x}_t-\frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_{\theta}(\mathbf{x}_t,t) \Big) μθ(xt,t)=μ~t(xt,αˉt 1(xt1αˉt ϵθ(xt)))=αt 1(xt1αˉt βtϵθ(xt,t))
于是 L t − 1 L_{t-1} Lt1可以简化为如下形式
E x 0 , ϵ [ β t 2 2 σ t 2 α t ( 1 − α ˉ t ) ∥ ϵ − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ , t ) ∥ 2 ] \Bbb{E}_{\mathbf{x_0},\epsilon}\Big[ \frac{\beta_t^2}{2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)}\lVert \epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\epsilon,t)\rVert^2\Big] Ex0,ϵ[2σt2αt(1αˉt)βt2ϵϵθ(αˉt x0+1αˉt ϵ,t)2]
作者又发现,将系数丢掉,训练更加稳定质量更好,于是就得到了下方的 L s i m p l e L_{\rm simple} Lsimple
L s i m p l e ( θ ) : = E t , x 0 , ϵ [ ∥ ϵ − ϵ θ ( α ˉ t x 0 + 1 − α ˉ t ϵ , t ) ∥ 2 ] L_{\rm simple}(\theta):=\Bbb{E}_{t,\mathbf{x_0},\epsilon}\Big[ \lVert \epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\epsilon,t)\rVert^2\Big] Lsimple(θ):=Et,x0,ϵ[ϵϵθ(αˉt x0+1αˉt ϵ,t)2]


你可能感兴趣的:(概率论,人工智能)