模式识别——EM算法

一、密度估计
  对于某些有一定意义的数据,其可能不服从于任何标准的概率分布,使用密度估计算法估计其概率密度。一般的,一个非标准的分布都可以使用多个高斯函数进行拟合,称为高斯混合模型【GMM】。
  考虑未知的随机变量 z \bm{z} z,其可能是隐藏的,或者未知的。考虑 x ( i ) , z ( i ) \bm{x}^{(i)}, \bm{z}^{(i)} x(i),z(i)的联合概率分布 p ( x ( i ) , z ( i ) ) = p ( x ( i ) ∣ z ( i ) ) p ( z ( i ) ) p(\bm{x}^{(i)}, \bm{z}^{(i)}) = p(\bm{x}^{(i)}|\bm{z}^{(i)})p(\bm{z}^{(i)}) p(x(i),z(i))=p(x(i)z(i))p(z(i)) z ( i ) ∼ B ( k , ϕ ) \bm{z}^{(i)}\sim B(k, \phi) z(i)B(k,ϕ)代表k个高斯分布的概率,以及 x ( i ) ∣ z ( i ) = j ∼ N ( μ j , Σ j ) \bm{x}^{(i)}|\bm{z}^{(i)} = j \sim N(\bm\mu_j, \bm\Sigma_j) x(i)z(i)=jN(μj,Σj)代表已知的第j个高斯分布的数据概率分布。
  如果 z ( i ) \bm{z}^{(i)} z(i)是已知的,则可以使用极大似然估计,形如 l n L ( ϕ , μ , Σ ) = ∑ i = 1 m l o g   p ( x ( i ) , z ( i ) ; ϕ , μ , Σ ) lnL(\phi, \bm\mu, \bm\Sigma) = \sum_{i =1}^m log\ p(\bm{x}^{(i)}, \bm{z}^{(i)};\phi, \bm\mu, \bm\Sigma) lnL(ϕ,μ,Σ)=i=1mlog p(x(i),z(i);ϕ,μ,Σ)可以得到 ϕ j = ∑ i = 1 m 1 { z ( i ) = j } / m μ j = ∑ i = 1 m 1 { c ( i ) = j } x ( i ) / ∑ i = 1 m 1 { c ( i ) = j } \phi_j = \sum_{i=1}^m 1\{z^{(i)} = j\}/m \\ \bm\mu_j = \sum_{i = 1}^m 1\{c^{(i)} = j\}\bm{x}^{(i)}/ \sum_{i = 1}^m 1\{c^{(i)} = j\} ϕj=i=1m1{z(i)=j}/mμj=i=1m1{c(i)=j}x(i)/i=1m1{c(i)=j}其称为高斯判别模型【GDA】。然而 z ( i ) \bm{z}^{(i)} z(i)是未知的,可以考虑尝试使用模型猜测 z ( i ) \bm{z}^{(i)} z(i)的值,使用极大似然拟合出更好的参数的值,再去猜测 z ( i ) \bm{z}^{(i)} z(i)的值,并进行迭代,该算法称为最大期望算法。


二、EM算法
  EM算法的步骤如下:
  (1)猜测未知的 z ( i ) \bm{z}^{(i)} z(i)的值,称为E步;
  (2)最大似然估计参数的值,称为M步。
  (3)迭代(1)-(2),直到收敛。

2.1 E步
  E步用于猜测未知的 z ( i ) \bm{z}^{(i)} z(i)的值,形如 w j ( i ) = p ( z ( i ) = j ∣ x ( i ) ; ϕ , μ , Σ ) w_j^{(i)} = p(\bm{z}^{(i)} = j|\bm{x}^{(i)}; \phi, \bm\mu, \bm\Sigma) wj(i)=p(z(i)=jx(i);ϕ,μ,Σ)该步计算了 z ( i ) = j \bm{z}^{(i)} = j z(i)=j的概率,即 w j ( i ) w_j^{(i)} wj(i)表示了 x ( i ) \bm{x}^{(i)} x(i)由第j个高斯分布生成占所有高斯分布生成的概率,根据贝叶斯公式 p ( B i ∣ A ) = P ( B i ) P ( A ∣ B i ) / ∑ j = 1 n P ( B j ) P ( A ∣ B j ) p(B_i|A) = P(B_i)P(A|B_i)/\sum_{j=1}^n P(B_j)P(A|B_j) p(BiA)=P(Bi)P(ABi)/j=1nP(Bj)P(ABj)形如 z ^ ( j ) = p ( x ( i ) ∣ z ( i ) = j ) p ( z ( i ) = j ) / ∑ l = 1 c p ( x ( i ) ∣ z ( i ) = l ) p ( z ( i ) = l ) \hat\bm{z}^{(j)} = p(\bm{x}^{(i)}|\bm{z}^{(i)} = j)p(\bm{z}^{(i)} = j)/\sum_{l=1}^cp(\bm{x}^{(i)}|\bm{z}^{(i)} = l)p(\bm{z}^{(i)} = l) z^(j)=p(x(i)z(i)=j)p(z(i)=j)/l=1cp(x(i)z(i)=l)p(z(i)=l)
2.2 M步
  M步用于更新对参数 ϕ \phi ϕ的更新,形如 ϕ j = ∑ i = 1 m w j ( i ) / m μ j = ∑ i = 1 m w j ( i ) x ( i ) / ∑ i = 1 m x ( i ) Σ j = ∑ i = 1 m w j ( i ) ( x ( i ) − μ j ) ( x ( i ) − μ j ) T / ∑ i = 1 m x ( i ) \phi_j = \sum_{i=1}^m w_j^{(i)} / m \\ \bm\mu_j = \sum_{i=1}^m w_j^{(i)}\bm{x}^{(i)}/\sum_{i=1}^m \bm{x}^{(i)} \\ \bm\Sigma_j = \sum_{i=1}^m w_j^{(i)}(\bm{x}^{(i)} - \bm\mu_j)(\bm{x}^{(i)} - \bm\mu_j)^T/\sum_{i=1}^m \bm{x}^{(i)} ϕj=i=1mwj(i)/mμj=i=1mwj(i)x(i)/i=1mx(i)Σj=i=1mwj(i)(x(i)μj)(x(i)μj)T/i=1mx(i)  其与 z ( i ) \bm{z}^{(i)} z(i)已知情况的区别仅仅是第i个点不是由第j个高斯分布生成的指示函数,而是概率,这对应了 z ( i ) \bm{z}^{(i)} z(i)已知与 z ( i ) \bm{z}^{(i)} z(i)未知的情况。


三、EM算法的一般形式
  GMM只是EM算法的一个特例,接下来介绍EM算法的一般形式。

3.1 琴生不等式
  首先介绍琴生【Jensen】不等式。考虑凸函数 f f f,取随机变量 x \bm{x} x,那么有 f ( E [ x ] ) ≤ E [ f ( x ) ] f(E[\bm{x}]) \le E[f(\bm{x})] f(E[x])E[f(x)]其直观理解为凸函数的割线在函数曲线上方。若 f f f是严格凸的,即 d 2 f / d x 2 > 0 d^2f/dx^2 > 0 d2f/dx2>0,那么上述等号当且仅当 x = E [ x ] \bm{x} = E[\bm{x}] x=E[x]时成立,即 x \bm{x} x以1的概率取得某常数值。

3.2 最大似然下界理论
  考虑模型的概率分布 p ( x , z ; θ ) p(\bm{x}, \bm{z};\bm\theta) p(x,z;θ),其中,仅有 x \bm{x} x可以被观测,尝试最大似然 l n L ( θ ) = ∑ i = 1 m l o g   p ( x ( i ) ; θ ) = ∑ i = 1 m l o g ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) \begin{aligned}lnL(\bm\theta) &= \sum_{i=1}^mlog\ p(\bm{x}^{(i)};\bm\theta)\\&= \sum_{i=1}^mlog \sum_{\bm{z}^{(i)}}p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)\end{aligned} lnL(θ)=i=1mlog p(x(i);θ)=i=1mlogz(i)p(x(i),z(i);θ)EM算法就是极大似然估计的过程,但其并非直接求解最大似然导数,其迭代的对似然函数创造了一个下界并寻找下界的最大值,从而靠近似然函数的最大值。由于 z \bm{z} z未被观测,因此似然函数的变量仅有 x \bm{x} x,因此,有 m a x θ   l n L ( θ ) = m a x θ   ∑ i = 1 m l o g   p ( x ( i ) ; θ ) = m a x θ   ∑ i = 1 m l o g ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) \begin{aligned}max_\bm\theta\ lnL(\bm\theta) &= max_\bm\theta\ \sum_{i=1}^mlog\ p(\bm{x}^{(i)};\bm\theta) \\&=max_\bm\theta\ \sum_{i=1}^mlog \sum_{\bm{z}^{(i)}}p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)\end{aligned} maxθ lnL(θ)=maxθ i=1mlog p(x(i);θ)=maxθ i=1mlogz(i)p(x(i),z(i);θ)引入辅助函数 Q i ( z ( i ) ) Q_i(\bm{z}^{(i)}) Qi(z(i)),其是未知随机变量 z ( i ) \bm{z}^{(i)} z(i)的概率分布,有 Q i ( z ( i ) ) ≥ 0 ∑ z ( i ) Q i ( z ( i ) ) = 1 Q_i(\bm{z}^{(i)}) \ge 0 \\ \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)}) = 1 Qi(z(i))0z(i)Qi(z(i))=1再考虑数学期望公式 E X = ∑ k = 1 m X k p k EX = \sum_{k=1}^m X_kp_k EX=k=1mXkpk那么有 ∑ i = 1 m l o g ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) = ∑ i = 1 m l o g ∑ z ( i ) Q i ( z ( i ) ) p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) = ∑ i = 1 m l o g E [ p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] \begin{aligned}\sum_{i=1}^mlog \sum_{\bm{z}^{(i)}}p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)& = \sum_{i=1}^mlog \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)}) \\&= \sum_{i=1}^mlog E[p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})] \end{aligned} i=1mlogz(i)p(x(i),z(i);θ)=i=1mlogz(i)Qi(z(i))p(x(i),z(i);θ)/Qi(z(i))=i=1mlogE[p(x(i),z(i);θ)/Qi(z(i))]这里是对所有的 z ( i ) \bm{z}^{(i)} z(i)对其概率分布乘以其函数进行求和,即该函数的期望。
  对数函数 l o g ( x ) log(x) log(x)是一个凹函数,那么根据琴生不等式,有 l o g ( E X ) ≥ E [ l o g ( X ) ] log(EX) \ge E[log(X)] log(EX)E[log(X)] ∑ i = 1 m l o g E [ p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] ≥ ∑ i = 1 m E [ l o g p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] = ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) l o g ( p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ) \begin{aligned}\sum_{i=1}^mlog E[p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})]& \ge \sum_{i=1}^mE[log p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})] \\&= \sum_{i=1}^m \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})log(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})) \end{aligned} i=1mlogE[p(x(i),z(i);θ)/Qi(z(i))]i=1mE[logp(x(i),z(i);θ)/Qi(z(i))]=i=1mz(i)Qi(z(i))log(p(x(i),z(i);θ)/Qi(z(i))) l n L ( θ ) ≥ ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) l o g ( p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ) lnL(\bm\theta) \ge \sum_{i=1}^m \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})log(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})) lnL(θ)i=1mz(i)Qi(z(i))log(p(x(i),z(i);θ)/Qi(z(i)))为了取得下界的最优值,即等号成立,即需 ∑ i = 1 m l o g E [ p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] = ∑ i = 1 m E [ l o g p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ] \sum_{i=1}^mlog E[p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})] = \sum_{i=1}^mE[log p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})] i=1mlogE[p(x(i),z(i);θ)/Qi(z(i))]=i=1mE[logp(x(i),z(i);θ)/Qi(z(i))]又需要对任何数据均有等式成立,因此需要选择可能的分配 Q Q Q来保证。根据等号成立的条件,需要 p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) = C o n s t p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)}) = Const p(x(i),z(i);θ)/Qi(z(i))=Const即对任何 z ( i ) \bm{z}^{(i)} z(i)都取得相同的值,即 Q i ( z ( i ) ) = Θ ( p ( x ( i ) , z ( i ) ; θ ) ) Q_i(\bm{z}^{(i)}) = \Theta(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)) Qi(z(i))=Θ(p(x(i),z(i);θ)) ∑ z ( i ) Q i ( z ( i ) ) = 1 \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)}) = 1 z(i)Qi(z(i))=1因此 Q i ( z ( i ) ) = p ( x ( i ) , z ( i ) ; θ ) / ∑ z ( i ) p ( x ( i ) , z ( i ) ; θ ) = p ( x ( i ) , z ( i ) ; θ ) / p ( x ( i ) ; θ ) = p ( z ( i ) ∣ x ( i ) ; θ ) \begin{aligned}Q_i(\bm{z}^{(i)})& = p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/\sum_{\bm{z}^{(i)}}p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta) \\&= p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/p(\bm{x}^{(i)};\bm\theta) \\&= p(\bm{z}^{(i)}|\bm{x}^{(i)};\bm\theta) \end{aligned} Qi(z(i))=p(x(i),z(i);θ)/z(i)p(x(i),z(i);θ)=p(x(i),z(i);θ)/p(x(i);θ)=p(z(i)x(i);θ)

3.3 EM算法理论
  在E步中猜测 z ( i ) \bm{z}^{(i)} z(i)的值,即是 Q i ( z ( i ) ) = p ( z ( i ) ∣ x ( i ) ; θ ) Q_i(\bm{z}^{(i)}) = p(\bm{z}^{(i)}|\bm{x}^{(i)};\bm\theta) Qi(z(i))=p(z(i)x(i);θ)这为当前的参数 ϕ \phi ϕ的似然函数确定了紧下界。
  在M步中更新 ϕ \phi ϕ的值,即是 ϕ = a r g m a x ϕ   ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) l o g ( p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ) \phi = argmax_{\phi}\ \sum_{i=1}^m \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})log(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})) ϕ=argmaxϕ i=1mz(i)Qi(z(i))log(p(x(i),z(i);θ)/Qi(z(i)))这在当前的紧下界中找到似然函数下界中的最大值,从而更新参数 ϕ \phi ϕ
  从凸优化的角度考虑,可以考虑 J ( ϕ , Q ) = ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) l o g ( p ( x ( i ) , z ( i ) ; θ ) / Q i ( z ( i ) ) ) J(\phi, Q) = \sum_{i=1}^m \sum_{\bm{z}^{(i)}}Q_i(\bm{z}^{(i)})log(p(\bm{x}^{(i)}, \bm{z}^{(i)};\bm\theta)/Q_i(\bm{z}^{(i)})) J(ϕ,Q)=i=1mz(i)Qi(z(i))log(p(x(i),z(i);θ)/Qi(z(i)))其中,E步选择 Q Q Q最优化 J J J,M步选择 ϕ \phi ϕ最优化 J J J,这是一种坐标上升法。

你可能感兴趣的:(机器学习)