本篇文章为阅读笔记,,主要内容围绕扩散模型和稳定扩散模型展开,介绍了kl loss、vae模型的损失函数以及变分下限作为扩展部分。扩散模型是一种生成模型,定义了一个逐渐扩散的马尔科夫链,逐渐项数据添加噪声,然后学习逆扩散过程,从噪声中构建所需的数据样本。稳定扩散模型在其基础上添加了编码器用以降维训练数据、降低训练成本,该模型亦添加了额外的文本嵌入向量,通过该向量模型得以根据文本生成图片。
主要内容参考deephub,“Diffusion 和Stable Diffusion的数学和工作原理详细解释”,https://zhuanlan.zhihu.com/p/597924053
This article is a reading note, the main content is around the diffusion model and stable diffusion model, and introduces the loss function of VAE model, KL loss and the variational lower bound as an extended part. Diffusion model is a generative model that defines a progressively diffused Markov chain, gradually adding noise to the data, and then learning the reverse diffusion process to construct the required data samples from the noise. On the basis of this, an encoder is added to the stable diffusion model to reduce the dimensionality of the training data data and reduce the training cost. The model also adds an additional text embedding vector, through which images can be generated from the text.
训练
q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_{t}|x_{t-1})=N(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I) q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
逐步向图像中添加高斯噪声,从而迭代的产生样本 x 1 , . . . , x T x_{1},...,x_{T} x1,...,xT
当 T → ∞ T\rightarrow\infty T→∞,将有完全高斯噪声图像
采样方法:可使用一个封闭公式在特定时间步长对图像直接取样
可通过重新参数化技巧得到
若z时高斯分布 N ( μ , σ 2 ) N(\mu,\sigma^{2}) N(μ,σ2)的取样,则 z = μ + σ ϵ z=\mu+\sigma\epsilon z=μ+σϵ
又 μ = 1 − β t x t − 1 , σ 2 = β t I \mu=\sqrt{1-\beta_{t}}x_{t-1}, \sigma^{2}=\beta_{t}I μ=1−βtxt−1,σ2=βtI
令 α t = 1 − β t , α ‾ t = ∏ i − 1 t α i , ϵ i ∼ N ( 0 , I ) , ϵ ‾ i ∼ N ( 0 , 1 ) \alpha_{t}=1-\beta_{t},\overline{\alpha}_{t=\prod_{i-1}^t}\alpha_{i},\epsilon_{i}\sim N(0,I),\overline{\epsilon}_{i}\sim N(0,1) αt=1−βt,αt=∏i−1tαi,ϵi∼N(0,I),ϵi∼N(0,1)
x t = 1 − β t x t − 1 + β t ϵ t − 1 = α t ( α t − 1 x t − 2 + 1 − α t − 1 ϵ t − 2 ) + 1 − α t ϵ t − 1 = α t α t − 1 x t − 2 + ( 1 − α t − 1 ) α t ϵ t − 2 + 1 − α t ϵ t − 1 \begin{aligned}x_{t}&=\sqrt{1-\beta_{t}}x_{t-1}+\sqrt{\beta_{t}}\epsilon_{t-1}\\&=\sqrt{\alpha_{t}}(\sqrt{\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_{t-1}}\epsilon_{t-2})+\sqrt{1-\alpha_{t}}\epsilon_{t-1}\\&=\sqrt{\alpha_{t}\alpha_{t-1}}x_{t-2}+\sqrt{(1-\alpha_{t-1})\alpha_{t}}\epsilon_{t-2}+\sqrt{1-\alpha_{t}}\epsilon_{t-1}\end{aligned} xt=1−βtxt−1+βtϵt−1=αt(αt−1xt−2+1−αt−1ϵt−2)+1−αtϵt−1=αtαt−1xt−2+(1−αt−1)αtϵt−2+1−αtϵt−1
X = ( 1 − α t − 1 ) α t ϵ t − 2 , Y = 1 − α t ϵ t − 1 X=\sqrt{(1-\alpha_{t-1})\alpha_{t}}\epsilon_{t-2}, Y=\sqrt{1-\alpha_{t}}\epsilon_{t-1} X=(1−αt−1)αtϵt−2,Y=1−αtϵt−1
∵ ϵ t − 2 , ϵ t − 1 ∼ N ( 0 , 1 ) \because \epsilon_{t-2}, \epsilon_{t-1} \sim N(0,1) ∵ϵt−2,ϵt−1∼N(0,1)
∴ α t ( 1 − α t − 1 ) ϵ t − 2 → X ∼ N ( 0 , α t ( 1 − α t − 1 ) I , 1 − α t − 1 ϵ t − 1 → Y ∼ N ( 0 , 1 − α t I ) \therefore \sqrt{\alpha_{t}(1-\alpha_{t-1})}\epsilon_{t-2}\rightarrow X \sim N(0,\alpha_{t}(1-\alpha_{t-1})I,\;\sqrt{1-\alpha_{t-1}}\epsilon_{t-1} \rightarrow Y \sim N(0,\sqrt{1-\alpha_{t}}I) ∴αt(1−αt−1)ϵt−2→X∼N(0,αt(1−αt−1)I,1−αt−1ϵt−1→Y∼N(0,1−αtI)
令 Z = X + Y Z=X+Y Z=X+Y,则有 Z ∼ N ( 0 , ( 1 − α t α t − 1 ) I ) Z \sim N(0,(1-\alpha_{t}\alpha_{t-1})I) Z∼N(0,(1−αtαt−1)I)
X + Y = 1 − α t α t − 1 ϵ ‾ t − 2 , ϵ ‾ t − 2 ∼ N ( 0 , 1 ) X+Y=\sqrt{1-\alpha_{t}\alpha_{t-1}}\overline{\epsilon}_{t-2},\;\overline{\epsilon}_{t-2} \sim N(0,1) X+Y=1−αtαt−1ϵt−2,ϵt−2∼N(0,1)
x t = α t α t − 1 x t − 2 + ( 1 − α t − 1 ) α t ϵ t − 2 + 1 − α t ϵ t − 1 = α t α t − 1 x t − 2 + X + Y = α t α t − 1 x t − 2 + 1 − α t α t − 1 ϵ ‾ t − 2 = α t α t − 1 … α 1 x 0 + 1 − α t α t − 1 … α 1 ϵ = α ‾ t x 0 + 1 − α ‾ t ϵ \begin{aligned}x_{t}&=\sqrt{\alpha_{t}\alpha_{t-1}}x_{t-2}+\sqrt{(1-\alpha_{t-1})\alpha_{t}}\epsilon_{t-2}+\sqrt{1-\alpha_{t}}\epsilon_{t-1}\\&=\sqrt{\alpha_{t}\alpha_{t-1}}x_{t-2}+X+Y\\&=\sqrt{\alpha_{t}\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_{t}\alpha_{t-1}}\overline{\epsilon}_{t-2}\\&=\sqrt{\alpha_{t}\alpha_{t-1}\dots\alpha_{1}}x_{0}+\sqrt{1-\alpha_{t}\alpha_{t-1}\dots\alpha_{1}}\epsilon\\&=\sqrt{\overline{\alpha}_{t}x_{0}}+\sqrt{1-\overline{\alpha}_{t}}\epsilon\end{aligned} xt=αtαt−1xt−2+(1−αt−1)αtϵt−2+1−αtϵt−1=αtαt−1xt−2+X+Y=αtαt−1xt−2+1−αtαt−1ϵt−2=αtαt−1…α1x0+1−αtαt−1…α1ϵ=αtx0+1−αtϵ
令 ∏ i = 1 t α i = α ‾ t \prod^{t}_{i=1}\alpha_{i}=\overline{\alpha}_{t} ∏i=1tαi=αt,则有①式 x t = α ‾ t x 0 + 1 − α ‾ t ϵ x_{t}=\sqrt{\overline{\alpha}_{t}}x_{0}+\sqrt{1-\overline{\alpha}_{t}}\epsilon xt=αtx0+1−αtϵ
注:各 ϵ \epsilon ϵ是相互独立但取自相同分布的标准正态随机变量
①式可用于在任何时间步骤直接对 x t x_{t} xt采样
该过程无法通过反转噪声实现,而需要训练神经网络approximated distribution p θ ( x t − 1 ∣ x t ) p_{\theta}(x_{t-1}|x_{t}) pθ(xt−1∣xt)来近似target distribution q ( x t − 1 ∣ x t ) q(x_{t-1}|x_{t}) q(xt−1∣xt),其服从正态分布
目标分布target distribution: q ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ ‾ ( x t , x 0 ) , β ‾ t I ) q(x_{t-1}|x_{t})=N(x_{t-1};\overline{\mu}(x_{t},x_{0}),\overline{\beta}_{t}I) q(xt−1∣xt)=N(xt−1;μ(xt,x0),βtI)
近似分布approximated distribution: p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , ∑ θ ( x t , t ) ) p_{\theta}(x_{t-1}|x_{t})=N(x_{t-1};\mu_{\theta}(x_{t},t),\sum_{\theta}(x_{t},t)) pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),∑θ(xt,t))
{ μ ( x t , t ) : = μ ‾ t ( x t , x 0 ) ∑ θ ( x t , t ) : = β t I \begin{cases}\mu(x_{t},t):=\overline{\mu}_t(x_{t},x_{0})\\\sum_{\theta}(x_{t},t):=\beta_{t}I\end{cases} {μ(xt,t):=μt(xt,x0)∑θ(xt,t):=βtI
L O S S = − log ( p θ ( x 0 ) ) LOSS=-\log{(p_{\theta}(x_0))} LOSS=−log(pθ(x0))
其中 p θ ( x 0 ) p_{\theta}(x_{0}) pθ(x0)取决于 x 1 , x 2 , … , x T x_{1},x_{2},\dots,x_{T} x1,x2,…,xT,较为棘手,因此采用与VAE中的类似的方法,通过优化下阶,简介优化不可处理的损失函数,关于KL LOSS与VAE loss function详见VAE模型中的损失函数
L O S S = M S E ( X , X ′ ) + K L ( N ( μ , σ 2 ) , N ( 0 , 1 ) ) LOSS=MSE(X,X')+KL(N(\mu,\sigma^{2}),N(0,1)) LOSS=MSE(X,X′)+KL(N(μ,σ2),N(0,1))
首项 M S E ( X , X ′ ) MSE(X,X') MSE(X,X′)称reconstruct loss 重构损失,反应生成结果与输入之间的差异
次项 K L ( N ( μ , σ 2 ) , N ( 0 , 1 ) ) KL(N(\mu,\sigma^{2}),N(0,1)) KL(N(μ,σ2),N(0,1))称KL lossKL散度正则项:在去除reconstruct loss的vae中其编码空间会呈现标准正态分布,故称其作用:使得编码器生成的隐藏变量尽可能符合标准正态分布
已知kl loss非负,故
− log p θ ( x 0 ) ≤ − log p θ ( x 0 ) + D K L ( q ( x 1 : T ∣ x 0 ) ∣ ∣ p θ ( x 1 : T ∣ x 0 ) ) -\log{p_{\theta}(x_{0})}\le-\log{p_{\theta}(x_0)}+D_{KL}(q(x_{1:T}|x_0)||p_\theta(x_{1:T}|x_0)) −logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T∣x0)∣∣pθ(x1:T∣x0))
− log p θ ( x 0 ) ≤ E q [ log q ( x 1 : T ∣ x 0 ) p θ ( x 1 : T ) ] -\log{p_\theta(x_0)}\le E_q[\frac{\log{q(x_{1:T}|x_0)}}{p_\theta(x_{1:T})}] −logpθ(x0)≤Eq[pθ(x1:T)logq(x1:T∣x0)]tip:此处引入变分下界,详见变分推断
展开有 − log p θ ( x 0 ) ≤ E q [ D K L ( q ( x T ∣ x 0 ) ∣ ∣ p θ ( x T ) ) + ∑ t = 2 T D K L ( q ( x t − 1 ∣ x t , x 0 ) ∣ ∣ p θ ( x t − 1 ∣ x t ) ) ] − log p θ ( x 0 ∣ x 1 ) -\log{p_{\theta}(x_{0})}\le E_q[D_{KL}(q(x_T|x_0)||p_\theta(x_T))+\sum_{t=2}^TD_{KL}(q(x_{t-1}|x_t,x_0)||p_\theta(x_{t-1}|x_t))]-\log{p_\theta(x_0|x_1)} −logpθ(x0)≤Eq[DKL(q(xT∣x0)∣∣pθ(xT))+∑t=2TDKL(q(xt−1∣xt,x0)∣∣pθ(xt−1∣xt))]−logpθ(x0∣x1)
关于KL散度的解释
this part is mainly about variational inference.
变分推断可以针对贝叶斯统计中对于后验概率的计算,MCMC马尔科夫链蒙特卡洛方法对于大量数据计算较慢,而变分推断提供了更快更简单的近似推断方法
假定观测量 x ⃗ = x 1 : n \vec{x}=x_{1:n} x=x1:n,隐藏变量 z ⃗ = z 1 : m \vec{z}=z_{1:m} z=z1:m,推断问题需要推测后验条件概率分布 p ( z ⃗ ∣ x ⃗ ) p(\vec{z}|\vec{x}) p(z∣x)。变分法的基本思想是将该问题转化为优化问题。首先引入一族关于隐藏变量的近似概率分布Q,需要从中寻找一个与真实分布的KL Divergence的最小的分布,即 q ∗ ( z ⃗ ) = a r g m i n q ( z ⃗ ∈ Q ) K L ( q ( z ⃗ ) ∣ ∣ p ( z ⃗ ∣ x ⃗ ) ) q^{*}(\vec{z})=argmin_{q(\vec{z}\in Q)}KL(q(\vec{z})||p(\vec{z}|\vec{x})) q∗(z)=argminq(z∈Q)KL(q(z)∣∣p(z∣x))。可用 q ∗ ( z ⃗ ) q^{*}(\vec{z}) q∗(z)近似代替真实后验分布。将变分推断转换为求极值,可通过选择更合适的Q使得密度分布足够灵活,同时足够简单适于优化
注:
变分推断详述
近似概率分布的选择
根据ELBO及平均场假设,可用coordinate ascent variational inference(坐标上升变分推断CAVI)
利用条件概率分布的链式法则有 p ( z 1 : m , x 1 : n ) = p ( x 1 : n ) ∏ j = 1 m p ( z j ∣ z 1 : ( j − 1 ) , x 1 : n ) p(z_{1:m},x_{1:n})=p(x_{1:n})\prod_{j=1}^{m}p(z_{j}|z_{1:(j-1)},x_{1:n}) p(z1:m,x1:n)=p(x1:n)∏j=1mp(zj∣z1:(j−1),x1:n)
变分分布的期望为 E [ log q ( z 1 : m ) ] = ∑ j = 1 m E j [ log q ( z j ) ] E[\log{q(z_{1:m})}] = \sum_{j=1}^{m}E_{j}[\log{q(z_{j})}] E[logq(z1:m)]=∑j=1mEj[logq(zj)]
此时,代入ELBO有
E L B O = E [ log ( z ⃗ ) ] + E [ log p ( x ⃗ ∣ z ⃗ ) ] − E [ log q ( z ⃗ ) ] = log p ( x 1 : n ) + ∑ j = 1 m E j [ log q ( z j ) ] − E j [ log q ( z j ) ] \begin{aligned}ELBO&=E[\log{(\vec{z})}]+E[\log{p(\vec{x}|\vec{z})}]-E[\log{q(\vec{z})}]\\&=\log{p(x_{1:n})}+\sum_{j=1}^{m}E_{j}[\log{q(z_{j})}]-E_{j}[\log{q(z_{j})}]\end{aligned} ELBO=E[log(z)]+E[logp(x∣z)]−E[logq(z)]=logp(x1:n)+j=1∑mEj[logq(zj)]−Ej[logq(zj)]
对 z k z_{k} zk求导并令其为零有 d E L B O d q ( z k ) = E − k [ log p ( z k ∣ z − k , x ) ] − log q ( z k ) − 1 = 0 \frac{dELBO}{dq(z_{k})}=E_{-k}[\log{p(z_{k}|z_{-k},x)}]-\log{q(z_{k})}-1=0 dq(zk)dELBO=E−k[logp(zk∣z−k,x)]−logq(zk)−1=0
由CAVI有更新形式 q ∗ ( z k ) q^{*}(z_{k}) q∗(zk) ∝ E [ log ( c o n d i t i o n a l ) ] E[\log{(conditional)}] E[log(conditional)]
每个训练轮次中各样本随机取步长t,各步长组合为嵌入向量,噪声依据步长生成
训练步骤
训练过程
反向扩散过程
扩散采样会迭代的项UNet提供完整尺寸图象获得结果,总扩散步数T和图像尺寸较大时较慢,由此引入稳定扩散stable diffusion
训练自编码器E,学习将图像数据降维成潜在数据,然后用解码器D,将潜在数据解码成图像。降噪过程将对潜在空间进行
相当于由处理三维空间,转换为处理三维空间的照片
正向扩散、反向扩散过程的操作内容同扩散模型
该模型在训练后可将文字编码后作为嵌入文本向量(额外输入),再输入随机潜在图像和步长嵌入向量,经扩散后解码成与文本有关的图像
通过使用交叉注意机制增强其去噪UNet,将内部扩散器转变为适应性(condition)图像生成器,通过开关调节输入,控制功能切换
L L D M = E t , z 0 , ϵ , y [ ∣ ∣ ϵ − ϵ θ ( z t , t , τ θ ( y ) ) ∣ ∣ 2 ] L_{LDM}=E_{t,z_{0},\epsilon,y}[||\epsilon-\epsilon_\theta(z_t,t,\tau_\theta(y))||^2] LLDM=Et,z0,ϵ,y[∣∣ϵ−ϵθ(zt,t,τθ(y))∣∣2]
扩散模型框架
稳定扩散模型框架
与扩散模型相比,