【学习笔记】Probabilistic Diffusion Model

本文内容核心内容非原创,大部分摘抄自该视频教程,详细讲解请移步原视频。个人在此基础上添加补充了一些公式的推导和证明,仅供学习参考,如有任何疑问欢迎在评论区指出。

一. 条件概率公式和高斯分布的KL散度

1. 条件概率的一般形式

P(A,B,C) = P(C|B,A) P(B,A) = P(C|B,A) P(B|A) P(A)

P(B,C|A) = \frac{P(A,B,C)}{P(A)} = \frac{P(B,A)}{P(A)} \cdot \frac{P(A,B,C)}{P(B,A)} = P(B|A) P(C|B,A)

2. 基于马尔可夫假设的条件概率

如果A、B、C满足马尔可夫关系A \rightarrow B \rightarrow C,那么有

P(A,B,C) = P(C|B,A) P(B|A) P(A) = P(C|B) P(B|A) P(A)

P(B,C|A) = P(B|A) P(C|B,A) = P(B|A) P(C|B)

3. 高斯分布的KL散度公式

对于两个(连续型)随机变量,它们的KL散度定义为:

D_{KL}(P||Q) = \int_{-\infty}^{\infty} p(x) ln\frac{p(x)}{q(x)}dx

P \sim N(\mu_1, \sigma_1^2), Q \sim N(\mu_2, \sigma_2^2),则有结论:

D_{KL}({P||Q}) = log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2}

证明:

p(x) = (2\pi \sigma_1^2)^{-\frac{1}{2}} exp\{ -\frac{1}{2\sigma_1^2}(x-\mu_1)^2 \}

q(x) = (2\pi \sigma_2^2)^{-\frac{1}{2}} exp\{ -\frac{1}{2\sigma_2^2}(x-\mu_2)^2 \}

\Rightarrow \frac{p(x)}{q(x)} = (\frac{\sigma_1^2}{\sigma_2^2})^{-\frac{1}{2}} exp\{ -[\frac{1}{2\sigma_1^2}(x-\mu_1)^2 - \frac{1}{2\sigma_2^2}(x-\mu_2)^2] \}

log(\frac{p(x)}{q(x)}) = log\frac{\sigma_2}{\sigma_1} - [\frac{1}{2\sigma_1^2}(x-\mu_1)^2 - \frac{1}{2\sigma_2^2}(x-\mu_2)^2]

由于p(x)是均值为\mu_1,方差为\sigma_1^2的高斯分布的密度函数,其有如下性质:

(1)\int_{-\infty}^{\infty} p(x) dx = 1

  (2)\mathbb{E} [P] = \int_{-\infty}^{\infty} x p(x) dx = \mu_1

  (3)Var(P) = \int_{-\infty}^{\infty} (x-\mu_1)^2 p(x) dx = \sigma_1^2

  (4) \mathbb{E} [P^2] = \int_{-\infty}^{\infty}x^2 p(x) dx = Var(P) + \mathbb{E}^2[P] = \sigma_1^2 + \mu_1^2

于是KL散度公式可化简为:

\begin{align*} D_{KL}(P||Q) &= \int_{-\infty}^{\infty} p(x) log(\frac{\sigma_2}{\sigma_1}) dx - \frac{1}{2\sigma_1^2}\int_{-\infty}^{\infty}(x-\mu_1)^2 p(x) dx + \frac{1}{2\sigma_2^2}\int_{-\infty}^{\infty}(x-\mu_2)^2 p(x) dx \\ &= log(\frac{\sigma_2}{\sigma_1}) - \frac{\sigma_1^2}{2\sigma_1^2} + \frac{1}{2\sigma_2^2}[\int_{-\infty}^{\infty} x^2p(x)dx - 2\mu_2\int_{-\infty}^{\infty} xp(x)dx + \mu_2^2 \int_{-\infty}^{\infty} p(x)dx] \\ &= log(\frac{\sigma_2}{\sigma_1}) - \frac{1}{2} + \frac{1}{2\sigma_2^2}[\sigma_1^2 + \mu_1^2 - 2\mu_1\mu_2 + \mu_2^2] \\ &= log(\frac{\sigma_2}{\sigma_1}) + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2} \end{align*}

4. 参数重整化

我们希望学习一个高斯分布的均值和方差,但由于采样过程对于\mu\sigma^2不可导,不能直接从高斯分布N(\mu, \sigma^2)中采样。我们可以先从标准正态分布中采样z \sim N(0,1),再得到x = \sigma \cdot z + \mux \sim N(\mu, \sigma^2)且该过程关于\mu\sigma^2可导。

二. VAE与多层VAE

1. 单层VAE原理公式与置信下界

【学习笔记】Probabilistic Diffusion Model_第1张图片

\begin{align*} p(x) &= \int_z p_{\theta}(x|z) p(z) dz \\ &= \int_z q_{\phi}(z|x) \frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z|x)}dx \\ &= \mathbb{E}_{z \sim q_{\phi}(z|x)}[\frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z|x)}] \end{align*}

\begin{align*} log(p(x)) &= log(\mathbb{E}_{z \sim q_{\phi}(z|x)}[\frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z|x)}]) \\ &\geq \mathbb{E}_{z \sim q_{\phi}(z|x)}[log(\frac{p_{\theta}(x|z)p(z)}{q_{\phi}(z|x)})] \text{(by Jensen Inequality)} \end{align*}

2. 多层VAE原理公式与置信下界

【学习笔记】Probabilistic Diffusion Model_第2张图片

\begin{align*} p(x) &= \int_{z_1} \int_{z_2} p(x, z_1, z_2) dz_1, dz_2 \\ &= \int_{z_1} \int_{z_2} q_{\phi}(z_1, z_2|x)\frac{p(x, z_1, z_2)}{q_{\phi}(z_1, z_2|x)} dz_1 dz_2 \\ &= \mathbb{E}_{z_1, z_2 \sim q_{\phi}(z_1, z_2|x)}[\frac{p(x, z_1, z_2)}{q_{\phi}(z_1, z_2|x)}] \end{align*}

\begin{align*} log(p(x)) &= log(\mathbb{E}_{z_1, z_2 \sim q_{\phi}(z_1, z_2|x)}[\frac{p(x, z_1, z_2)}{q_{\phi}(z_1, z_2|x)}]) \\ &\geq \mathbb{E}_{z_1, z_2 \sim q_{\phi}(z_1, z_2|x)}[log(\frac{p(x, z_1, z_2)}{q_{\phi}(z_1, z_2|x)})] \end{align*}

在马尔可夫假设下:

p(x, z_1, z_2) = p(x|z_1) p(z_1|z_2)p(z_2)

q(z_1, z_2|x) = q(z_1|x) q(z_2|z_1)

因此有

L(\theta, \phi) = \mathbb{E}_{q(z_1, z_2|x)}[logp(x|z_1) - logq(z_1|x) + logp(z_1|z_2) - logq(z_2|z_1)+logp(z_2)]

三. Diffusion Model 图示

【学习笔记】Probabilistic Diffusion Model_第3张图片

从右至左是扩散过程(添加噪声),从左至右是逆扩散过程(去除噪声).

四. 扩散过程(Diffusion Process)

1. 给定初始数据分布,可以不断向分布中添加高斯噪声,该噪声的标准差是以固定值\beta_t而确定的,均值是以固定值\beta_t和当前t时刻的数据x_t决定的。这个过程是一个马尔可夫过程。

2. 随着t的不断增大,最终数据分布x_T变成了一个各项独立的高斯分布(isotropic Gaussian Distribution)

q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)

q(x_{1:T}|x_0) = \prod_{t=1}^T q(x_t|x_{t-1})

3. 任意时刻q(x_t)的推导也可以完全基于x_0\beta_t来计算出,无需迭代

注:若X \sim N(\mu_1, \sigma_1^2)Y \sim N(\mu_2, \sigma_2^2)独立,则

\mathbb{E}[aX+bY] = a\mu_1 + b\mu_2Var(aX+bY) = a^2\sigma_1^2 + b^2 \sigma_2^2

对于\sqrt{\alpha_t - \alpha_t \alpha_{t-1}} z_{t-2}\sqrt{1-\alpha_t}z_{t-1}:

\sqrt{\alpha_t(1-\alpha_{t-1})}z_{t-2} + \sqrt{1-\alpha_t} z_{t-1} \sim N(0, (1-\alpha_t \alpha_{t-1})I) \sim \sqrt{1 - \alpha_t \alpha_{t-1}}z

\alpha_t = 1 - \beta_t\bar{\alpha}_t = \prod_{k=1}^t q(x_k|x_{k-1})

\begin{align*} x_t &= \sqrt{\alpha_t} x_{t-1} + \sqrt{1 - \alpha_t} z_{t-1} \\ &= \sqrt{\alpha_t} (\sqrt{\alpha_{t-1}}x_{t-2} + \sqrt{1 - \alpha_{t-1}}z_{t-2}) + \sqrt{1 - \alpha_t}z_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{\alpha_t - \alpha_t \alpha_{t-1}} z_{t-2} + \sqrt{1 - \alpha_t} z_{t-1} \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1 - \alpha_t \alpha_{t-1}}z \\ &= \dots \dots \\ &= \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}z \end{align*}

\Rightarrow q(x_t|x_0) = N(x_t; \sqrt{\bar{\alpha_t}}x_0, (1-\bar{\alpha_t})I)

通常情况下设置\beta_1 < \dots < \beta_T,即添加噪声的方差越来越大。

五. 逆扩散过程(Reverse Process)

逆过程是用模型从高斯噪声恢复原始数据

p_{\theta}(x_{0:T}) = p(x_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t)

p_{\theta}(x_{t-1}|x_t) = N(x_{t-1}; \mu_{\theta}(x_t, t), \Sigma_{\theta}(x_t, t))

六. 后验扩散条件概率 

概率分布q(x_{t-1}|x_t, x_0)可以用公式表达,即给定x_tx_0,我们可以计算出x_{t-1}的分布。

q(x_{t-1}|x_t, x_0) = N(x_{t-1}; \tilde{\mu}(x_t, x_0), \tilde{\beta_t}I)

q(x_{t-1}|x_t, x_0) = \frac{q(x_{t-1}, x_t | x_0)}{q(x_t | x_0)} = \frac{q(x_t | x_{t-1}, x_0) q(x_{t-1}|x_0)}{q(x_t | x_0)}

Note:

\begin{align*} &q(x_t | x_{t-1}) = N (x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I) \\ &q(x_t|x_0) = N(x_t; \sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I) \end{align*}

因此有

\begin{align*}q(x_{t-1}|x_t, x_0) &\propto exp\{ -\frac{1}{2}[\frac{(x_t - \sqrt{\alpha_t} x_{t-1})^2}{\beta_t} + \frac{(x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0)^2}{1 - \bar{\alpha}_{t-1}} - \frac{(x_t.- \sqrt{\bar{\alpha}_t})^2}{1 - \bar{\alpha}_t}] \} \\ &=exp\{ -\frac{1}{2} [(\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) x_{t-1}^2 - (\frac{2\sqrt{\alpha_t}}{\beta_t}x_t + \frac{2\sqrt{\bar{\alpha}_{t-1}}}{1 - \bar{\alpha}_{t-1}}x_0)x_{t-1} + C(x_t, x_0)] \} \end{align*}

这里C(x_t, x_0)是一个与x_{t-1}不相关的常数。

\tilde{\beta}_t = \frac{1}{\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}} = \frac{1}{\frac{\alpha_t - \bar{\alpha}_t + \beta_t}{\beta_t (1 - \bar{\alpha}_t)}} = \frac{1}{\frac{1 - \bar{\alpha}_t}{\beta_t (1 - \bar{\alpha}_{t-1})}} = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

\begin{align*} \tilde{\mu}_t (x_t, x_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha_{t-1}}}}{1 - \bar{\alpha_{t-1}}}x_0) / (\frac{\alpha_t}{\beta_t} + \frac{1}{1 - \bar{\alpha}_{t-1}}) \\ &= (\frac{\sqrt{\alpha_t}}{\beta_t} x_t + \frac{\sqrt{\bar{\alpha_{t-1}}}}{1 - \bar{\alpha_{t-1}}}x_0) \cdot \frac{\beta_t ( 1- \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} \\ &= \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}__{t-1})}{1 - \bar{\alpha}_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha_t}} x_0 \end{align*}

根据前面x_t = \sqrt{1 - \bar{\alpha}_t} z_t + \sqrt{\bar{\alpha}_t} x_0,有x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}} (x_t - \sqrt{1 - \bar{\alpha}_t}z_t)。将x_0的表达式带入q(x_{t-1} | x_t, x_0)的分布中,可重新给出均值表达式。即在给定x_0的条件下,后验条件高斯分布的均值计算只与x_tz_t有关。

\begin{align*} \tilde{\mu_t} &= \frac{\sqrt{\alpha_t} (1 - \bar{\alpha}_{t-1})}{1-\bar{\alpha_t}}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} \cdot \frac{1}{\sqrt{\bar{\alpha}_t}} (x_t - \sqrt{1 - \bar{\alpha}_t} z_t) \\ &= \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t - \frac{\beta_t \sqrt{1 - \bar{\alpha}_t}}{(1 - \bar{\alpha}_t)\sqrt{\alpha_t}}z_t \\ &= \frac{1}{\sqrt{\alpha_t}} (x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} z_t) \end{align*}

七. 目标数据分布的似然函数

在负对数似然函数的基础上加一个KL散度,就构成了负对数似然上界。

\begin{align*} -log(p_{\theta}(x_0)) &\leq -log(p_{\theta}(x_0)) + D_{KL}(q(x_{1:T}|x_0)||p_{\theta}(x_{1:T}|x_0)) \\ &= -log(p_{\theta}(x_0)) + \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})/p_{\theta}(x_0)}] \\ &= -log(p_{\theta}(x_0)) + \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})} + log(p_{\theta}(x_0))] \\ &= \mathbb{E}_{x_{1:T} \sim q(x_{1:T}|x_0)}[log \frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}] \end{align*}

由上式:

-\mathbb{E}_{q(x_0)}[log(p_{\theta}(x_0))] \leq \mathbb{E}_{q(x_{o:T})}[log\frac{q(x_{1:T}|x_0)}{p_{\theta}(x_{0:T})}] \triangleq L_{VLB}

左边为两个分布的交叉熵,右边为它的上界。对上界最小化即对交叉熵最小化。

Note:q(x_t|x_{t-1}) = q(x_t | x_{t-1}, x_0) = \frac{q(x_t, x_{t-1}|x_0)}{q(x_{t-1}|x_0)} = \frac{q(x_{t-1}|x_t, x_0) q(x_t|x_0)}{q(x_{t-1} | x_0)}

\begin{align*} L_{VLB} &= \mathbb{E}_{q(x_{0:T})}[log\frac{q(x_{1:T} | x_0)}{p_{\theta}(x_{0:T})}] \\ &= \mathbb{E}_{q(x_{0:T})}[log\frac{\prod_{t=1}^Tq(x_t|x_{t-1})}{p_{\theta}(x_T)\prod_{t=1}^Tp_{\theta}(x_{t-1}|x_t)}] \\ &= \mathbb{E}_{q(x_{0:T})}[-log(p_{\theta}(x_T)) + \sum_{t=1}^T log \frac{q(x_t | x_{t-1})}{p_{\theta}(x_{t-1}|x_t)}] \\ &= \mathbb{E}_{q(x_{0:T})}[-log(p_{\theta}(x_T)) + \sum_{t=2}^T log \frac{q(x_t | x_{t-1})}{p_{\theta}(x_{t-1}|x_t)} + log\frac{q(x_1|x_0)}{p_{\theta}(x_0|x_1)}] \\ &= \mathbb{E}_{q(x_{0:T})}[-log(p_{\theta}(x_T)) + \sum_{t=1}^T log \frac{q(x_{t-1} |x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \cdot \frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} + log\frac{q(x_1|x_0)}{p_{\theta}(x_0|x_1)}] \text{(use the note)} \\ &=\mathbb{E}_{q(x_{0:T})}[-log(p_{\theta}(x_T)) + \sum_{t=1}^T log \frac{q(x_{t-1} |x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} + \sum_{t=2}^T log\frac{q(x_t|x_0)}{q(x_{t-1}|x_0)} + log\frac{q(x_1|x_0)}{p_{\theta}(x_0|x_1)}] \end{align*}

\begin{align*} L_{VLB} &= \mathbb{E}_{q(x_{0:T})}[-log(p_{\theta}(x_T)) + \sum_{t=1}^T log \frac{q(x_{t-1} |x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} + log\frac{q(x_T|x_0)}{q(x_{1}|x_0)} + log\frac{q(x_1|x_0)}{p_{\theta}(x_0|x_1)}] \\ &= \mathbb{E}_{q(x_{0:T})}[log\frac{q(x_T|x_0)}{p_{\theta}(x_T)} + \sum_{t=2}^T log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} - log(p_{\theta}(x_0|x_1))] \\ &= \underbrace{D_{KL}(q(x_T|x_0)||p_{\theta}(x_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_{KL}(q(x_{t-1}|x_t, x_0)||p_{\theta}(x_{t-1}|x_t))}_{L_{t-1}} - \underbrace{\mathbb{E}_{q(x_{0:T})}[log(p_{\theta}(x_0|x_1))]}_{L_0} \end{align*}

原论文将逆扩散过程p_{\theta}(x_{t-1}|x_t)分布的方差设置为一个与\beta相关的常数,因此可训练参数\theta仅存在于均值中

\begin{cases} &q(x_{t-1}|x_t, x_0) = N(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t) \\ &p_{\theta}(x_{t-1}|x_t) = N(x_{t-1}; \mu_{\theta}(x_t, t), \underbrace{\Sigma_{\theta}(x_t, t)}_{constant \ \sigma_t^2}) \end{cases}L_{t-1}

由第一部分证得的两高斯分布的KL散度通式可得:

L_{t-1} = \frac{1}{2\sigma_t^2} ||\tilde{\mu}_t(x_t, x_0) - \mu_{\theta}(x_t, t)||^2 + C

\begin{align*} L_{t-1} - C &= \mathbb{E}_{x_0, \epsilon} [\frac{1}{2\sigma_t^2} ||\tilde{\mu}_t(x_t(x_0, \epsilon),\frac{1}{\sqrt{\bar{\alpha}_t}}(x_t(x_0, \epsilon)-\sqrt{1-\bar{\alpha}_t}\epsilon)) - \mu_{\theta}(x_t(x_0, \epsilon),t)||^2] \\ &= \mathbb{E}_{x_0, \epsilon} [\frac{1}{2\sigma_t^2} ||\frac{1}{\sqrt{\bar{\alpha}_t}}(x_t(x_0, \epsilon) - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon) - \mu_{\theta}(x_t(x_0, \epsilon), t)||^2] \end{align*}

这里我们选择的建模目标是让D_{\theta}网络输出等于\epsilon

\mu_{\theta}(x_t, t) = \tilde{\mu}_t(x_t, \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \sqrt{1 - \bar{\alpha}_t}\epsilon_{\theta(x_t, t)})) = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\epsilon_{\theta}(x_t, t))

于是L_{t-1}可化简为

\mathbb{E}_{x_0, \epsilon} [ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} ||\epsilon - \epsilon_{\theta}(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t)||^2]

L_{simple}(\theta) = \mathbb{E}_{x_0, \epsilon} [||\epsilon - \epsilon_{\theta}(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t)||^2]

你可能感兴趣的:(学习,人工智能)