[1] 用 network memerisation 造成的 clean / noisy 数据 loss 差异来区分 clean / noisy data。当得到一批数据的 normalised loss { l i ∈ [ 0 , 1 ] } i = 1 n \{l_i\in[0,1]\}_{i=1}^n {li∈[0,1]}i=1n 之后,用 beta 混合模型(BMM)拟合两个峰,此前的一篇 [3] 是用高斯混合模型(GMM),两篇都是用 EM 算法求参数。
这里以 [1] 为背景,记 EM 算法的笔记,参考 [4-6]。
[1] 要跟据 loss 值判断一个图文对是 aligned(即 clean)的还是 partial-/mis-aligned(即 noisy)的。具体来说,先求得一批图文对的 normalised loss,分布如上图,有两个峰,各用一个 beta 分布分量拟合,总分布为: p ( l ) = Σ k = 1 K = 2 p ( z = k ) ⋅ p ( l ∣ z = k ) = Σ k K λ k ⋅ p ( l ∣ k ) = Σ k K λ k ⋅ B ( l ; α k , β k ) \begin{aligned} p(l) &= \Sigma_{k=1}^{K=2} p(z=k) \cdot p(l|z=k) \\ &= \Sigma_k^K\lambda_k \cdot p(l|k) \\ &= \Sigma_k^K\lambda_k \cdot \Beta(l;\alpha_k,\beta_k) \end{aligned} p(l)=Σk=1K=2p(z=k)⋅p(l∣z=k)=ΣkKλk⋅p(l∣k)=ΣkKλk⋅B(l;αk,βk) 即 [1] 的 (4) 式。隐变量 z = 1 , 2 z=1,2 z=1,2 表示 l l l 属于哪一个峰,1 clean 2 noisy,beta 分布介绍见 [7]。EM 算法求完参数之后,用 p ( z i = 1 ∣ l i ) p(z_i=1|l_i) p(zi=1∣li) 和一个阈值 δ \delta δ 指定第 i 对图文是 clean 还是 noisy。
参考 [4-6],log likelihood: L L = Σ i = 1 n log p ( l i ) = Σ i log Σ k = 1 K = 2 p ( l i , z i = k ) = Σ i log Σ k Q i ( k ) ⋅ p ( l i , k ) Q i ( k ) ( 1 ) ≥ Σ i Σ k Q i ( k ) log p ( l i , k ) Q i ( k ) ( 2 ) \begin{aligned} LL &= \Sigma_{i=1}^n \log p(l_i) \\ &= \Sigma_i \log \Sigma_{k=1}^{K=2} p(l_i, z_i=k) \\ &= \Sigma_i \log \Sigma_k Q_i(k)\cdot \frac{p(l_i,k)}{Q_i(k)} & (1) \\ &\ge \Sigma_i \Sigma_kQ_i(k) \log \frac{p(l_i,k)}{Q_i(k)} & (2) \end{aligned} LL=Σi=1nlogp(li)=ΣilogΣk=1K=2p(li,zi=k)=ΣilogΣkQi(k)⋅Qi(k)p(li,k)≥ΣiΣkQi(k)logQi(k)p(li,k)(1)(2)
用 Jensen 不等式求下界是因为 sum-log 比 log-sum 好求导;用最大似然求参数时希望下界能取等号,即 (1) = (2),这样最大化此下界就等同于最大化 log likelihood。
要 (1) = (2),一种方法是令 p ( l i , k ) Q i ( k ) = c \frac{p(l_i,k)}{Q_i(k)}=c Qi(k)p(li,k)=c,c 是常数,这样 ( 1 ) = Σ i log ( c 的期望 ) = Σ i log c (1)=\Sigma_i\log(c的期望)=\Sigma_i\log c (1)=Σilog(c的期望)=Σilogc,而 ( 2 ) = Σ i [ log ( c ) 的期望 ] = Σ i log c (2)=\Sigma_i [\log(c) 的期望]=\Sigma_i\log c (2)=Σi[log(c)的期望]=Σilogc,故 (1) = (2)。此时有: p ( l i , k ) Q i ( k ) = c ( 3 ) p ( l i , k ) = c ⋅ Q i ( k ) p ( l i ) = Σ k p ( l i , k ) = c ⋅ Σ k Q i ( k ) = c ⋅ 1 = c \begin{aligned} \frac{p(l_i,k)}{Q_i(k)} &= c & (3)\\ p(l_i,k) &= c \cdot Q_i(k) \\ p(l_i) = \Sigma_k p(l_i,k) &= c \cdot \Sigma_k Q_i(k) = c \cdot 1 = c \end{aligned} Qi(k)p(li,k)p(li,k)p(li)=Σkp(li,k)=c=c⋅Qi(k)=c⋅ΣkQi(k)=c⋅1=c(3) 代回 (3)、移项得 Q i ( k ) = p ( l i , k ) c = p ( l i , k ) p ( l i ) = p ( k ∣ l i ) Q_i(k) = \frac{p(l_i,k)}{c} = \frac{p(l_i,k)}{p(l_i)} = p(k|l_i) Qi(k)=cp(li,k)=p(li)p(li,k)=p(k∣li)。也就是说令 Q i ( k ) = p ( k ∣ l i ) Q_i(k) = p(k|l_i) Qi(k)=p(k∣li) 时 (1) = (2),最大化下界等同最大化 log likelihood。
于是 EM 算法开始吟唱:
E、M 步重复迭代若干次,其中 M 步 λ k \lambda_k λk 用拉格朗日乘子(Lagrange multiplier)求,参考 [6],约束是 Σ k λ k = 1 \Sigma_k \lambda_k=1 Σkλk=1,故拉格朗日函数: L = L L + γ ( 1 − Σ k λ k ) = Σ i Σ k Q i ( k ) log p ( l i , k ) Q i ( k ) + γ ( 1 − Σ k λ k ) = Σ i Σ k Q i ( k ) log p ( l i , k ) − Σ i Σ k Q i ( k ) log Q i ( k ) ⏟ 常数 + γ ( 1 − Σ k λ k ) ∝ Σ i Σ k Q i ( k ) log p ( l i , k ) + γ ( 1 − Σ k λ k ) = Σ i Σ k Q i ( k ) log λ k ⋅ B ( l i ; α k , β k ) ⏟ 与 λ k 无关 + γ ( 1 − Σ k λ k ) \begin{aligned} L &= LL + \gamma(1 - \Sigma_k \lambda_k) \\ &= \Sigma_i\Sigma_k Q_i(k)\log\frac{p(l_i,k)}{Q_i(k)} + \gamma(1 - \Sigma_k \lambda_k) \\ &= \Sigma_i\Sigma_k Q_i(k)\log p(l_i,k) - \Sigma_i\Sigma_k \underbrace{Q_i(k)\log Q_i(k)}_{\text{常数}} + \gamma(1 - \Sigma_k \lambda_k) \\ &\propto \Sigma_i\Sigma_k Q_i(k)\log p(l_i,k) + \gamma(1 - \Sigma_k \lambda_k) \\ &= \Sigma_i\Sigma_k Q_i(k)\log \lambda_k \cdot \underbrace{\Beta(l_i;\alpha_k,\beta_k)}_{与 \lambda_{k} 无关} + \gamma(1 - \Sigma_k \lambda_k) \end{aligned} L=LL+γ(1−Σkλk)=ΣiΣkQi(k)logQi(k)p(li,k)+γ(1−Σkλk)=ΣiΣkQi(k)logp(li,k)−ΣiΣk常数 Qi(k)logQi(k)+γ(1−Σkλk)∝ΣiΣkQi(k)logp(li,k)+γ(1−Σkλk)=ΣiΣkQi(k)logλk⋅与λk无关 B(li;αk,βk)+γ(1−Σkλk) 分别求导、令等于零: { ∂ L ∂ λ k = 1 λ k Σ i Q i ( k ) + γ = 0 ∂ L ∂ γ = 1 − Σ k λ k = 0 \left\{\begin{aligned} \frac{\partial L}{\partial \lambda_k} &= \frac{1}{\lambda_k}\Sigma_i Q_i(k) + \gamma &= 0 \\ \frac{\partial L}{\partial \gamma} &= 1 - \Sigma_k \lambda_k &= 0 \end{aligned}\right. ⎩ ⎨ ⎧∂λk∂L∂γ∂L=λk1ΣiQi(k)+γ=1−Σkλk=0=0 解得 { γ = − n λ k = 1 n Σ i Q i ( k ) = 1 n Σ i λ k ⋅ B ( l i ; α k , β k ) Σ k ′ λ k ′ ⋅ B ( l i ; α k ′ , β k ′ ) \left\{\begin{aligned} \gamma &= -n \\ \lambda_k &= \frac{1}{n}\Sigma_i Q_i(k) = \frac{1}{n}\Sigma_i \frac{\lambda_k \cdot \Beta(l_i;\alpha_k,\beta_k)}{\Sigma_{k'} \lambda_{k'} \cdot \Beta(l_i;\alpha_{k'},\beta_{k'})} \end{aligned}\right. ⎩ ⎨ ⎧γλk=−n=n1ΣiQi(k)=n1ΣiΣk′λk′⋅B(li;αk′,βk′)λk⋅B(li;αk,βk) 而 α k , β k \alpha_k, \beta_k αk,βk 则用 beta 分布的均值、方差公式反求,参考 [7],对 B ( x ; α , β ) \Beta(x;\alpha,\beta) B(x;α,β) 有: { μ = E X = α α + β σ 2 = D X = α β ( α + β ) 2 ( α + β + 1 ) \left\{\begin{aligned} \mu &= \mathbb{E}X &&= \frac{\alpha}{\alpha + \beta} \\ \sigma^2 &= \mathbb{D}X &&= \frac{\alpha \beta}{(\alpha + \beta)^2(\alpha + \beta + 1)} \end{aligned}\right. ⎩ ⎨ ⎧μσ2=EX=DX=α+βα=(α+β)2(α+β+1)αβ 反求得 { β = ( 1 μ − 1 ) α α = μ [ μ ( 1 − μ ) σ 2 − 1 ] \left\{\begin{aligned} \beta &= (\frac{1}{\mu} - 1)\alpha \\ \alpha &= \mu \left[ \frac{\mu(1 - \mu)}{\sigma^2} - 1 \right] \end{aligned}\right. ⎩ ⎨ ⎧βα=(μ1−1)α=μ[σ2μ(1−μ)−1]
对应 [1] 代码的 BetaMixture1D,其将 Q i ( ⋅ ) Q_i(\cdot) Qi(⋅) 称为 responsibilities
, p ( l i , z i = k ) = p ( k ) p ( l i ∣ k ) = λ k B ( l i ; α k , β k ) p(l_i,z_i=k)=p(k)p(l_i|k)=\lambda_k\Beta(l_i;\alpha_k,\beta_k) p(li,zi=k)=p(k)p(li∣k)=λkB(li;αk,βk) 称为 weighted_likelihood
。