首先讲解了马尔科夫模型,之后讲述了马尔科夫模型的学习过程,之后深入到隐马尔可夫模型,涉及到观测序列概率的计算和在给定观测序列的情况下最可能的隐状态序列,最后给出了隐马尔可夫模型的学习
马尔科夫模型
状态集合: S = { s 1 , ⋯ , s N } \mathcal{S} = \{s_1,\cdots,s_N\} S={s1,⋯,sN}
观测状态序列: x = x 1 , ⋯ , x t , ⋯ , x T x = x_1,\cdots,x_t,\cdots,x_T x=x1,⋯,xt,⋯,xT,其中 x t ∈ S x_t \in \mathcal{S} xt∈S
状态初始化概率: π i = p ( x 1 = s i ) , 1 ≤ i ≤ N \pi_i = p(x_1 = s_i),1 \leq i \leq N πi=p(x1=si),1≤i≤N
状态转移概率: a i j = p ( x t = s j ∣ x t − 1 = s i ) , 1 ≤ i , j ≤ N a_{ij} = p(x_t = s_j|x_{t - 1} = s_i),1 \leq i,j \leq N aij=p(xt=sj∣xt−1=si),1≤i,j≤N
计算观测状态序列的概率(假设当前的状态 x t x_t xt的生成只依赖于前一个状态 x t − 1 x_{t-1} xt−1,通常称为二阶马尔科夫模型):
P ( x ; θ ) = ∏ t = 1 T p ( x t ∣ x 1 , ⋯ , x t − 1 ) ≈ p ( x 1 ) × ∏ t = 2 T p ( x t ∣ x t − 1 ) \begin{aligned} P(x;\theta) &= \prod_{t = 1}^Tp(x_t|x_1,\cdots,x_{t - 1}) \\ &\approx p(x_1) \times \prod_{t = 2}^Tp(x_t|x_{t - 1}) \end{aligned} P(x;θ)=t=1∏Tp(xt∣x1,⋯,xt−1)≈p(x1)×t=2∏Tp(xt∣xt−1)
其中 θ = { p ( x ) ∣ x ∈ S } ⋃ { p ( x ′ ∣ x ) ∣ x , x ′ ∈ S } \theta = \{p(x)|x \in \mathcal{S}\} \bigcup \{p(x'|x)|x,x' \in \mathcal{S}\} θ={p(x)∣x∈S}⋃{p(x′∣x)∣x,x′∈S}
模型的学习
目的:学习得到模型的参数 θ \theta θ,也即状态初始化概率和状态转移概率的学习,通过极大似然估计完成参数学习
假设训练集包含 D D D个样本 D = { x ( d ) } d = 1 D \mathscr{D} = \{x^{(d)}\}^D_{d = 1} D={x(d)}d=1D,使用极大似然估计来从训练数据中自动获取最优模型参数:
θ ^ = a r g m a x θ { L ( θ ) } \hat{\theta} = arg \mathop{max}\limits_{\theta}\{L(\theta)\} θ^=argθmax{L(θ)}
似然函数的对数形式:
L ( θ ) = ∑ d = 1 D log P ( x ( d ) ; θ ) = ∑ d = 1 D ( log p ( x 1 ( d ) ) + ∑ t = 2 T ( d ) log p ( x t ( d ) ∣ x t − 1 ( d ) ) ) \begin{aligned} L(\theta) &= \sum_{d = 1}^D \log P(x^{(d)};\theta) \\ &= \sum_{d = 1}^D \left( \log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)}) \right) \end{aligned} L(θ)=d=1∑DlogP(x(d);θ)=d=1∑D⎝ ⎛logp(x1(d))+t=2∑T(d)logp(xt(d)∣xt−1(d))⎠ ⎞
其中 T ( d ) T^{(d)} T(d)表示第 d d d个训练数据的序列长度,模型参数还需要满足一下两个约束条件:
∑ x ∈ S p ( x ) = 1 ∀ x : ∑ x ′ ∈ S p ( x ′ ∣ x ) = 1 \sum_{x \in \mathcal{S}} p(x) = 1 \\ \forall x:\sum_{x' \in \mathcal{S}}p(x'|x) = 1 x∈S∑p(x)=1∀x:x′∈S∑p(x′∣x)=1
引入拉格朗日乘子法来实现约束条件下的极值求解:
J ( θ , λ , γ ) = L ( θ ) − λ ( ∑ x ∈ S p ( x ) − 1 ) − ∑ x ∈ S γ x ( ∑ x ′ ∈ S p ( x ′ ∣ x ) − 1 ) J(\theta,\lambda,\gamma) = L(\theta) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right) J(θ,λ,γ)=L(θ)−λ(x∈S∑p(x)−1)−x∈S∑γx(x′∈S∑p(x′∣x)−1)
其中 λ \lambda λ是与状态初始概率约束相关的拉格朗日乘子, γ = { γ x ∣ x ∈ S } \gamma = \{\gamma_x|x \in \mathcal{S}\} γ={γx∣x∈S}是与状态转移概率约束相关的拉格朗日乘子的集合
- 首先计算状态初始概率 p ( x ) p(x) p(x)的偏导
∂ J ( θ , λ , γ ) ∂ p ( x ) = ( ∑ d = 1 D ( log p ( x 1 ( d ) ) + ∑ t = 2 T ( d ) log p ( x t ( d ) ∣ x t − 1 ( d ) ) ) − λ ( ∑ x ∈ S p ( x ) − 1 ) − ∑ x ∈ S γ x ( ∑ x ′ ∈ S p ( x ′ ∣ x ) − 1 ) ) p ( x ) ′ = ∂ ∂ p ( x ) ∑ d = 1 D log p ( x 1 ( d ) ) − λ = 1 p ( x ) ∑ d = 1 D δ ( x 1 ( d ) , x ) − λ \begin{aligned} \frac{\partial J(\theta,\lambda,\gamma)}{\partial p(x)} &= \left( \sum_{d = 1}^D \left( \log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)}) \right) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right) \right)'_{p(x)} \\ &= \frac{\partial}{\partial p(x)} \sum_{d = 1}^D \log p(x_1^{(d)}) - \lambda \\ &= \frac{1}{p(x)} \sum_{d = 1}^D \delta(x_1^{(d)},x) - \lambda \end{aligned} ∂p(x)∂J(θ,λ,γ)=⎝ ⎛d=1∑D⎝ ⎛logp(x1(d))+t=2∑T(d)logp(xt(d)∣xt−1(d))⎠ ⎞−λ(x∈S∑p(x)−1)−x∈S∑γx(x′∈S∑p(x′∣x)−1)⎠ ⎞p(x)′=∂p(x)∂d=1∑Dlogp(x1(d))−λ=p(x)1d=1∑Dδ(x1(d),x)−λ
其中 δ ( a , b ) \delta(a,b) δ(a,b)的取值当 a = b a = b a=b时为1,否则为0
用 c ( x , D ) c(x,\mathscr{D}) c(x,D)表示训练数据中第一个状态是 x x x的次数:
c ( x , D ) = ∑ d = 1 D δ ( x 1 ( d ) , x ) c(x,\mathscr{D}) = \sum_{d = 1}^D \delta(x_1^{(d)},x) c(x,D)=d=1∑Dδ(x1(d),x)
之后计算拉格朗日乘子 λ \lambda λ的偏导
∂ J ( θ , λ , γ ) ∂ λ = ∑ x ∈ S p ( x ) − 1 \frac{\partial J(\theta,\lambda,\gamma)}{\partial \lambda} = \sum_{x \in \mathcal{S}}p(x) - 1 ∂λ∂J(θ,λ,γ)=x∈S∑p(x)−1
根据上式可以推导:
p ( x ) = c ( x , D ) λ p(x) = \frac{c(x,\mathscr{D})}{\lambda} p(x)=λc(x,D)
有 λ = ∑ x ∈ S c ( x , D ) \lambda = \sum_{x \in \mathcal{S}}c(x,\mathscr{D}) λ=∑x∈Sc(x,D)所以状态初始化概率的估计公式:
p ( x ) = c ( x , D ) ∑ x ′ ∈ S c ( x ′ , D ) p(x) = \frac{c(x,\mathscr{D})}{\sum_{x' \in \mathcal{S}}c(x',\mathscr{D})} p(x)=∑x′∈Sc(x′,D)c(x,D)
- 计算状态转移概率的偏导
∂ J ( θ , λ , γ ) ∂ p ( x ′ ∣ x ) = ( ∑ d = 1 D ( log p ( x 1 ( d ) ) + ∑ t = 2 T ( d ) log p ( x t ( d ) ∣ x t − 1 ( d ) ) ) − λ ( ∑ x ∈ S p ( x ) − 1 ) − ∑ x ∈ S γ x ( ∑ x ′ ∈ S p ( x ′ ∣ x ) − 1 ) ) p ( x ′ ∣ x ) ′ = ∂ ∂ p ( x ′ ∣ x ) ∑ d = 1 D ∑ t = 2 T ( d ) log p ( x t ( d ) ∣ x t − 1 ( d ) ) − γ x = 1 p ( x ′ ∣ x ) ∑ d = 1 D ∑ t = 2 T ( d ) δ ( x t − 1 ( d ) , x ) δ ( x t ( d ) , x ′ ) − γ x \begin{aligned} \frac{\partial J(\theta,\lambda,\gamma)}{\partial p(x'|x)} &= \left( \sum_{d = 1}^D \left( \log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)}) \right) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right) \right)'_{p(x'|x)} \\ &= \frac{\partial}{\partial p(x'|x)} \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \log p\left(x_t^{(d)}|x_{t - 1}^{(d)} \right) - \gamma_x \\ &= \frac{1}{p(x'|x)} \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \delta(x_{t - 1}^{(d)},x) \delta(x_t^{(d)},x') - \gamma_x \end{aligned} ∂p(x′∣x)∂J(θ,λ,γ)=⎝ ⎛d=1∑D⎝ ⎛logp(x1(d))+t=2∑T(d)logp(xt(d)∣xt−1(d))⎠ ⎞−λ(x∈S∑p(x)−1)−x∈S∑γx(x′∈S∑p(x′∣x)−1)⎠ ⎞p(x′∣x)′=∂p(x′∣x)∂d=1∑Dt=2∑T(d)logp(xt(d)∣xt−1(d))−γx=p(x′∣x)1d=1∑Dt=2∑T(d)δ(xt−1(d),x)δ(xt(d),x′)−γx
将训练数据中 x ′ x' x′紧跟着出现在 x x x后面的次数表示为:
c ( x , x ′ , D ) = ∑ d = 1 D ∑ t = 2 T ( d ) δ ( x t − 1 ( d ) , x ) δ ( x t ( d ) , x ′ ) c(x,x',\mathscr{D}) = \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \delta(x_{t - 1}^{(d)},x) \delta(x_t^{(d)},x') c(x,x′,D)=d=1∑Dt=2∑T(d)δ(xt−1(d),x)δ(xt(d),x′)
之后计算拉格朗日乘子 γ x \gamma_x γx的偏导:
∂ J ( θ , λ , γ ) ∂ γ x = ∑ x ′ ∈ S p ( x ′ ∣ x ) − 1 \frac{\partial J(\theta,\lambda,\gamma)}{\partial \gamma_x} = \sum_{x' \in \mathcal{S}}p(x'|x) - 1 ∂γx∂J(θ,λ,γ)=x′∈S∑p(x′∣x)−1
从上式可以推出:
p ( x ′ ∣ x ) = c ( x , x ′ , D ) γ x p(x'|x) = \frac{c(x,x',\mathscr{D})}{\gamma_x} p(x′∣x)=γxc(x,x′,D)
又可以得出 γ x = ∑ x ′ ′ ∈ S c ( x , x ′ ′ , D ) \gamma_x = \sum_{x'' \in \mathcal{S}}c(x,x'',\mathscr{D}) γx=∑x′′∈Sc(x,x′′,D),故:
p ( x ′ ∣ x ) = c ( x , x ′ , D ) ∑ x ′ ′ ∈ S c ( x , x ′ ′ , D ) p(x'|x) = \frac{c(x,x',\mathscr{D})}{\sum_{x'' \in \mathcal{S}}c(x,x'',\mathscr{D})} p(x′∣x)=∑x′′∈Sc(x,x′′,D)c(x,x′,D)
计算观测概率
目的:计算观测序列的概率
一个例子:在暗室中不知道外面的天气,但是可以通过触摸地面的潮湿程度来推测外部的天气情况,此时,地面的潮湿程度是观测状态,外面的天气是隐状态
观测状态集合: O = { o 1 , ⋯ , o m } \mathscr{O} = \{o_1,\cdots,o_m\} O={o1,⋯,om}
隐状态集合: S = { s 1 , ⋯ , s N } \mathcal{S} = \{s_1,\cdots,s_N\} S={s1,⋯,sN}
观测状态序列: x = x 1 , ⋯ , x t , ⋯ , x T x = x_1,\cdots,x_t,\cdots,x_T x=x1,⋯,xt,⋯,xT
隐状态序列: z = z 1 , ⋯ , z t ⋯ , z T z = z_1,\cdots,z_t\cdots,z_T z=z1,⋯,zt⋯,zT
隐状态初始化概率: π i = p ( z 1 = s i ) , 1 ≤ i ≤ N \pi_i = p(z_1 = s_i),1 \leq i \leq N πi=p(z1=si),1≤i≤N
隐状态转移概率: a i j = p ( z t = s j ∣ z t − 1 = s i ) , 1 ≤ i , j ≤ N a_{ij} = p(z_t = s_j|z_{t - 1} = s_i),1 \leq i,j \leq N aij=p(zt=sj∣zt−1=si),1≤i,j≤N
观测状态生成概率: b j ( k ) = p ( x t = o k ∣ z t = s j ) , 1 ≤ j ≤ N ⋀ 1 ≤ k ≤ M b_j(k) = p(x_t = o_k|z_t = s_j),1 \leq j \leq N \bigwedge 1 \leq k \leq M bj(k)=p(xt=ok∣zt=sj),1≤j≤N⋀1≤k≤M
隐马尔可夫模型:
P ( x ; θ ) = ∑ z P ( x , z ; θ ) = ∑ z p ( z 1 ) × p ( x 1 ∣ z 1 ) × ∏ t = 2 T p ( z t ∣ z t − 1 ) × p ( x t ∣ z t ) \begin{aligned} P(x;\theta) &= \sum_zP(x,z;\theta) \\ &= \sum_zp(z_1) \times p(x_1|z_1) \times \prod_{t = 2}^Tp(z_t|z_{t - 1}) \times p(x_t|z_t) \end{aligned} P(x;θ)=z∑P(x,z;θ)=z∑p(z1)×p(x1∣z1)×t=2∏Tp(zt∣zt−1)×p(xt∣zt)
模型参数 θ = { p ( z ) ∣ z ∈ S } ⋃ { p ( z ′ ∣ z ) ∣ z , z ′ ∈ S } ⋃ { p ( x ∣ z ) ∣ x ∈ O ⋀ z ∈ S } \theta = \{p(z)|z \in \mathcal{S}\} \bigcup \{p(z'|z)|z,z' \in \mathcal{S}\} \bigcup \{p(x|z)|x \in \mathscr{O} \bigwedge z \in \mathcal{S}\} θ={p(z)∣z∈S}⋃{p(z′∣z)∣z,z′∈S}⋃{p(x∣z)∣x∈O⋀z∈S}
前向概率
部分观测状态序列 x 1 , ⋯ , x t x_1,\cdots,x_t x1,⋯,xt与第 t t t个隐状态为 s i s_i si的联合概率称为前向概率:
α t ( i ) = P ( x 1 , ⋯ , x t , z t = s i ; θ ) \alpha_t(i) = P(x_1,\cdots,x_t,z_t = s_i;\theta) αt(i)=P(x1,⋯,xt,zt=si;θ)
使用动态规划算法递归计算:
- 初始化: t = 1 t = 1 t=1
α 1 ( i ) = π i b i ( x 1 ) , 1 ≤ i ≤ N \alpha_1(i) = \pi_ib_i(x_1),1 \leq i \leq N α1(i)=πibi(x1),1≤i≤N
- 递归: t = 2 , ⋯ , T t = 2,\cdots,T t=2,⋯,T
α t ( j ) = ( ∑ i = 1 N α t − 1 ( i ) a i j ) b j ( x t ) , 1 ≤ j ≤ N \alpha_t(j) = \left( \sum_{i = 1}^N\alpha_{t - 1}(i)a_{ij}\right)b_j(x_t),1 \leq j \leq N αt(j)=(i=1∑Nαt−1(i)aij)bj(xt),1≤j≤N
- 终止:
P ( x ; θ ) = ∑ i = 1 N α T ( i ) P(x;\theta) = \sum_{i = 1}^N\alpha_T(i) P(x;θ)=i=1∑NαT(i)
后向概率
第 t t t个隐状态为 s j s_j sj生成部分观测状态序列 x t + 1 , ⋯ , x T x_{t + 1},\cdots,x_T xt+1,⋯,xT的条件概率称为后向概率,定义为:
β t ( i ) = P ( x t + 1 , ⋯ , x T ∣ z t = s i ; θ ) \beta_t(i) = P(x_{t + 1},\cdots,x_T|z_t = s_i;\theta) βt(i)=P(xt+1,⋯,xT∣zt=si;θ)
使用动态规划算法递归计算如下:
- 初始化: t = T t = T t=T
β T ( i ) = 1 , 1 ≤ i ≤ N \beta_T(i) = 1,1 \leq i \leq N βT(i)=1,1≤i≤N
- 递归: t = T − 1 , ⋯ , 1 t = T - 1,\cdots,1 t=T−1,⋯,1
β t ( i ) = ∑ j = 1 N a i j b j ( x t + 1 ) β t + 1 ( j ) , 1 ≤ i ≤ N \beta_t(i) = \sum_{j = 1}^Na_{ij}b_j(x_{t + 1})\beta_{t + 1}(j),1 \leq i \leq N βt(i)=j=1∑Naijbj(xt+1)βt+1(j),1≤i≤N
- 终止:
P ( x ; θ ) = ∑ i = 1 N π i b i ( x 1 ) β 1 ( i ) P(x;\theta) = \sum_{i = 1}^N\pi_ib_i(x_1)\beta_1(i) P(x;θ)=i=1∑Nπibi(x1)β1(i)
计算最优隐状态序列-Viterbi算法
目的:在给定一个观测状态序列 x = x 1 , ⋯ , x t , ⋯ , x T x = x_1,\cdots,x_t,\cdots,x_T x=x1,⋯,xt,⋯,xT和模型参数 θ \theta θ的条件下求出最优的隐状态序列
z ^ = a r g m a x z { P ( z ∣ x ; θ ) } = a r g m a x z { P ( x , z ; θ ) P ( x ; θ ) } = a r g m a x z { P ( x , z ; θ ) } = a r g m a x z { ∑ z p ( z 1 ) × p ( x 1 ∣ z 1 ) × ∏ t = 2 T p ( z t ∣ z t − 1 ) × p ( x t ∣ z t ) } \begin{aligned} \hat{z} &= arg \mathop{max}\limits_z\left\{P(z|x;\theta)\right\} \\ &= arg \mathop{max}\limits_z\left\{\frac{P(x,z;\theta)}{P(x;\theta)}\right\} \\ &= arg \mathop{max}\limits_z\left\{P(x,z;\theta)\right\} \\ &= arg \mathop{max}\limits_z\left\{\sum_zp(z_1) \times p(x_1|z_1) \times \prod_{t = 2}^Tp(z_t|z_{t - 1}) \times p(x_t|z_t)\right\} \\ \end{aligned} z^=argzmax{P(z∣x;θ)}=argzmax{P(x;θ)P(x,z;θ)}=argzmax{P(x,z;θ)}=argzmax{z∑p(z1)×p(x1∣z1)×t=2∏Tp(zt∣zt−1)×p(xt∣zt)}
假设 δ i = m a x j ∈ h e a d s ( i ) { ω j i δ j } \delta_i = \mathop{max}\limits_{j \in heads(i)}\{\omega_{ji}\delta_j\} δi=j∈heads(i)max{ωjiδj}是从结点1到结点 i i i的最大路径取值, ψ i = a r g m a x j ∈ h e a d s ( i ) { ω j i δ j } \psi_i = arg \mathop{max}\limits_{j \in heads(i)}\{\omega_{ji}\delta_j\} ψi=argj∈heads(i)max{ωjiδj}
Viterbi算法
- 初始化:
δ 1 ( i ) = π i b 1 ( x 1 ) , ψ 1 ( i ) = 0 \delta_1(i) = \pi_ib_1(x_1),\psi_1(i) = 0 δ1(i)=πib1(x1),ψ1(i)=0
- 递归: t = 2 , ⋯ , T t = 2,\cdots,T t=2,⋯,T
δ t ( j ) = m a x 1 ≤ i ≤ N { δ t − 1 ( i ) a i j } b j ( x t ) ψ t ( j ) = a r g m a x 1 ≤ i ≤ N { δ t − 1 ( i ) a i j } b j ( x t ) \delta_t(j) = \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{t - 1}(i)a_{ij}\}b_j(x_t) \\ \psi_t(j) = arg \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{t - 1}(i)a_{ij}\}b_j(x_t) δt(j)=1≤i≤Nmax{δt−1(i)aij}bj(xt)ψt(j)=arg1≤i≤Nmax{δt−1(i)aij}bj(xt)
- 结束:
P ^ = m a x 1 ≤ i ≤ N { δ T ( i ) } z ^ T = a r g m a x 1 ≤ i ≤ N { δ T ( i ) } \hat{P} = \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{T}(i)\} \\ \hat{z}_T = arg \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{T}(i)\} P^=1≤i≤Nmax{δT(i)}z^T=arg1≤i≤Nmax{δT(i)}
- 回溯: t = T − 1 , ⋯ , 1 t = T - 1,\cdots,1 t=T−1,⋯,1
z ^ t = ψ t + 1 ( z ^ t + 1 ) \hat{z}_t = \psi_{t + 1}(\hat{z}_{t + 1}) z^t=ψt+1(z^t+1)
模型的学习-前向后向算法
目的:估计模型参数,我们知道的是观测序列,隐状态序列是不确定的,所以参数的主要挑战是需要对指数级的隐状态序列进行求和
给定训练集 D = { x ( d ) } d = 1 D \mathscr{D} = \{x^{(d)}\}^D_{d = 1} D={x(d)}d=1D,使用极大似然估计来获得模型的最优参数:
θ ^ = a r g m a x θ { L ( θ ) } \hat{\theta} = arg \mathop{max}\limits_{\theta}\{L(\theta)\} θ^=argθmax{L(θ)}
Expectation-Maximization(简称EM)算法被广泛用于估计隐状态模型的参数。令 X \mathbf{X} X表示一组观测数据, Z \mathbf{Z} Z表示未观测数据,也就是隐状态序列:EM算法在以下两个步骤中迭代运行:
- E步:确定对数似然的期望值
Q ( θ ∣ θ o l d ) = E Z ∣ X ; θ o l d [ log P ( X , Z ; θ ) ] \mathbf{Q}(\theta|\theta^{old}) = \mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\log P(\mathbf{X},\mathbf{Z};\theta)\right] Q(θ∣θold)=EZ∣X;θold[logP(X,Z;θ)]
- M步:计算最大化该期望值的参数
θ n e w = a r g m a x θ { Q ( θ ∣ θ o l d ) } \theta^{new} = arg \mathop{max}\limits_{\theta}\left\{\mathbf{Q}(\theta|\theta^{old})\right\} θnew=argθmax{Q(θ∣θold)}
那么使用EM算法来训练隐马尔可夫模型,在E步实际使用的目标函数定义如下:
J ( θ , λ , γ , ϕ ) = ∑ d = 1 D E Z ∣ X ( d ) ; θ o l d [ log P ( x ( d ) , Z ; θ ) ] − λ ( ∑ z ∈ S p ( z ) − 1 ) − ∑ z ∈ S γ z ( ∑ z ′ ∈ S p ( z ′ ∣ z ) − 1 ) − ∑ z ∈ S ϕ z ( ∑ x ∈ O p ( x ∣ z ) − 1 ) \begin{aligned} J(\theta,\lambda,\gamma,\phi) &= \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X}^{(d)};\theta^{old}}\left[\log P(\mathbf{x}^{(d)},\mathbf{Z};\theta)\right] \\ &- \lambda\left(\sum_{z \in \mathcal{S}}p(z) - 1\right) \\ &- \sum_{z \in \mathcal{S}}\gamma_z\left(\sum_{z' \in \mathcal{S}}p(z'|z) - 1\right) \\ &- \sum_{z \in \mathcal{S}}\phi_z\left(\sum_{x \in \mathscr{O}}p(x|z) - 1\right) \end{aligned} J(θ,λ,γ,ϕ)=d=1∑DEZ∣X(d);θold[logP(x(d),Z;θ)]−λ(z∈S∑p(z)−1)−z∈S∑γz(z′∈S∑p(z′∣z)−1)−z∈S∑ϕz(x∈O∑p(x∣z)−1)
通过计算偏导,可以得到公式:
p ( z ) = c ( z , D ) ∑ z ′ ∈ S c ( z ′ , D ) p ( z ′ ∣ z ) = c ( z , z ′ , D ) ∑ z ′ ′ ∈ S c ( z , z ′ ′ , D ) p ( x ∣ z ) = c ( z , x , D ) ∑ x ′ ∈ O c ( z , x ′ , D ) p(z) = \frac{c(z,\mathscr{D})}{\sum_{z' \in \mathcal{S}}c(z',\mathscr{D})} \\ p(z'|z) = \frac{c(z,z',\mathscr{D})}{\sum_{z'' \in \mathcal{S}}c(z,z'',\mathscr{D})} \\ p(x|z) = \frac{c(z,x,\mathscr{D})}{\sum_{x' \in \mathscr{O}}c(z,x',\mathscr{D})} p(z)=∑z′∈Sc(z′,D)c(z,D)p(z′∣z)=∑z′′∈Sc(z,z′′,D)c(z,z′,D)p(x∣z)=∑x′∈Oc(z,x′,D)c(z,x,D)
其中 c ( ⋅ ) c(\cdot) c(⋅)表示计数函数, c ( z , D ) c(z,\mathscr{D}) c(z,D)是在训练集 D \mathscr{D} D上第一个隐状态是 z z z的次数的期望值, c ( z , z ′ , D ) c(z,z',\mathscr{D}) c(z,z′,D)是在训练集上隐状态 z ′ z' z′出现在隐状态 z z z的次数的期望值, c ( z , x , D ) c(z,x,\mathscr{D}) c(z,x,D)是训练集上隐状态 z z z生成观测状态 x x x的次数的期望值。
上述的期望值定义如下:
c ( z , D ) ≡ ∑ d = 1 D E Z ∣ X ; θ o l d [ δ ( z 1 , z ) ] c ( z , z ′ , D ) ≡ ∑ d = 1 D E Z ∣ X ; θ o l d [ ∑ t = 2 T ( d ) δ ( z t − 1 , z ) δ ( z t , z ′ ) ] c ( z , x , D ) ≡ ∑ d = 1 D E Z ∣ X ; θ o l d [ ∑ t = 1 T ( d ) δ ( z t , z ) δ ( x t ( d ) , x ) ] \begin{aligned} c(z,\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\delta(z_1,z)\right] \\ c(z,z',\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\sum_{t = 2}^{T^{(d)}}\delta(z_{t - 1},z)\delta(z_t,z')\right] \\ c(z,x,\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\sum_{t = 1}^{T{(d)}}\delta(z_t,z)\delta(x_t^{(d)},x)\right] \end{aligned} c(z,D)c(z,z′,D)c(z,x,D)≡d=1∑DEZ∣X;θold[δ(z1,z)]≡d=1∑DEZ∣X;θold⎣ ⎡t=2∑T(d)δ(zt−1,z)δ(zt,z′)⎦ ⎤≡d=1∑DEZ∣X;θold⎣ ⎡t=1∑T(d)δ(zt,z)δ(xt(d),x)⎦ ⎤
期望基于隐状态的后验概率 P ( z ∣ x ( d ) ; θ o l d ) P(\mathbf{z}|\mathbf{x}^{(d)};\theta^{old}) P(z∣x(d);θold),计算期望的过程涉及指数级数量的计算,下面以隐状态转换次数的期望为例:
E Z ∣ X ; θ o l d [ δ ( z t − 1 , z ) δ ( z t , z ′ ) ] = ∑ z P ( z ∣ x ( d ) ; θ o l d ) δ ( z t − 1 , z ) δ ( z t , z ′ ) = ∑ z P ( x ( d ) , z ; θ o l d ) P ( x ( d ) ; θ o l d ) δ ( z t − 1 , z ) δ ( z t , z ′ ) = 1 P ( x ( d ) ; θ o l d ) ∑ z P ( x ( d ) , z ; θ o l d ) δ ( z t − 1 , z ) δ ( z t , z ′ ) = P ( x ( d ) , z t − 1 = z , z t = z ′ ; θ o l d ) P ( x ( d ) ; θ o l d ) \begin{aligned} \mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\delta(z_{t - 1},z)\delta(z_t,z')\right] &= \sum_{\mathbf{z}}P(\mathbf{z}|\mathbf{x}^{(d)};\theta^{old})\delta(z_{t - 1},z)\delta(z_t,z') \\ &= \sum_{\mathbf{z}}\frac{P(\mathbf{x}^{(d)},\mathbf{z};\theta^{old})}{P(\mathbf{x}^{(d)};\theta^{old})}\delta(z_{t - 1},z)\delta(z_t,z') \\ &= \frac{1}{P(\mathbf{x}^{(d)};\theta^{old})}\sum_{\mathbf{z}}P(\mathbf{x}^{(d)},\mathbf{z};\theta^{old})\delta(z_{t - 1},z)\delta(z_t,z') \\ &= \frac{P(\mathbf{x}^{(d)},z_{t - 1} = z,z_t = z';\theta^{old})}{P(\mathbf{x}^{(d)};\theta^{old})} \end{aligned} EZ∣X;θold[δ(zt−1,z)δ(zt,z′)]=z∑P(z∣x(d);θold)δ(zt−1,z)δ(zt,z′)=z∑P(x(d);θold)P(x(d),z;θold)δ(zt−1,z)δ(zt,z′)=P(x(d);θold)1z∑P(x(d),z;θold)δ(zt−1,z)δ(zt,z′)=P(x(d);θold)P(x(d),zt−1=z,zt=z′;θold)
上式的分母可以使用前向概率来计算,下面是分母的计算:
P ( x , z t − 1 = s i , z t = s j ; θ ) = P ( x 1 , ⋯ , x t − 1 , z t − 1 = s i ; θ ) × P ( z t = s j ∣ z t − 1 = s i ; θ ) × P ( x t ∣ z t = s j ; θ ) × P ( x t + 1 , ⋯ , x T ∣ z t = s j ; θ ) = α t − 1 ( i ) a i j b j ( x t ) β t ( j ) \begin{aligned} P(\mathbf{x},z_{t - 1} = s_i,z_t = s_j;\theta) = &P(x_1,\cdots,x_{t - 1},z_{t - 1} = s_i;\theta) \times \\ &P(z_t = s_j|z_{t - 1} = s_i;\theta) \times \\ &P(x_t|z_t = s_j;\theta) \times \\ &P(x_{t + 1},\cdots,x_T|z_t = s_j;\theta) \\ = &\alpha_{t - 1}(i)a_{ij}b_j(x_t)\beta_t(j) \end{aligned} P(x,zt−1=si,zt=sj;θ)==P(x1,⋯,xt−1,zt−1=si;θ)×P(zt=sj∣zt−1=si;θ)×P(xt∣zt=sj;θ)×P(xt+1,⋯,xT∣zt=sj;θ)αt−1(i)aijbj(xt)βt(j)
估计隐状态初始化概率
p ( z ) = c ( z , D ) ∑ z ′ ∈ S c ( z ′ , D ) = ∑ d = 1 D P ( x ( d ) , z 1 = z ; θ o l d ) ∑ d = 1 D P ( x ( d ) ; θ o l d ) π ‾ i = ∑ d = 1 D α 1 ( i ) β 1 ( i ) ∑ d = 1 D ∑ i = 1 N α T ( d ) ( i ) \begin{aligned} p(z) &= \frac{c(z,\mathscr{D})}{\sum_{z' \in \mathcal{S}}c(z',\mathscr{D})} \\ &= \frac{\sum_{d = 1}^DP(\mathbf{x}^{(d)},z_1 = z;\theta^{old})}{\sum_{d = 1}^DP(\mathbf{x}^{(d)};\theta^{old})} \\ \overline{\pi}_i &= \frac{\sum_{d = 1}^D\alpha_1(i)\beta_1(i)}{\sum_{d = 1}^D\sum_{i = 1}^N\alpha_{T^{(d)}}(i)} \end{aligned} p(z)πi=∑z′∈Sc(z′,D)c(z,D)=∑d=1DP(x(d);θold)∑d=1DP(x(d),z1=z;θold)=∑d=1D∑i=1NαT(d)(i)∑d=1Dα1(i)β1(i)
估计隐状态转换概率
p ( z ′ ∣ z ) = c ( z , z ′ , D ) ∑ z ′ ′ ∈ S c ( z , z ′ ′ , D ) = ∑ d = 1 D ∑ t = 2 T ( d ) P ( x , z t − 1 = z , z t = z ′ ; θ o l d ) ∑ d = 1 D ∑ t = 2 T ( d ) P ( x , z t − 1 = z ; θ o l d ) a ‾ i j = ∑ d = 1 D ∑ t = 2 T ( d ) α t − 1 ( i ) a i j b j ( x t ( d ) ) β t ( j ) ∑ d = 1 D ∑ t = 2 T ( d ) α t − 1 ( i ) β t ( j ) \begin{aligned} p(z'|z) &= \frac{c(z,z',\mathscr{D})}{\sum_{z'' \in \mathcal{S}}c(z,z'',\mathscr{D})} \\ &= \frac{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}P(\mathbf{x},z_{t - 1} = z,z_t = z';\theta^{old}) }{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}P(\mathbf{x},z_{t - 1} = z;\theta^{old}) }\\ \overline{a}_{ij} &= \frac{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}\alpha_{t - 1}(i)a_{ij}b_j(x^{(d)}_t)\beta_t(j)}{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}\alpha_{t - 1}(i)\beta_t(j)} \end{aligned} p(z′∣z)aij=∑z′′∈Sc(z,z′′,D)c(z,z′,D)=∑d=1D∑t=2T(d)P(x,zt−1=z;θold)∑d=1D∑t=2T(d)P(x,zt−1=z,zt=z′;θold)=∑d=1D∑t=2T(d)αt−1(i)βt(j)∑d=1D∑t=2T(d)αt−1(i)aijbj(xt(d))βt(j)
估计观测状态生成概率
p ( x ∣ z ) = c ( z , x , D ) ∑ x ′ ∈ O c ( z , x ′ , D ) = ∑ d = 1 D ∑ t = 1 T ( d ) δ ( x t ( d ) , x ) P ( x ( d ) , z t = z ; θ o l d ) ∑ d = 1 D ∑ t = 1 T ( d ) P ( x ( d ) , z t = z ; θ o l d ) b ‾ i ( k ) = ∑ d = 1 D ∑ t = 1 T ( d ) δ ( x ( d ) , o k ) α t ( i ) β t ( i ) ∑ d = 1 D ∑ t = 1 T ( d ) α t ( i ) β t ( i ) \begin{aligned} p(x|z) &= \frac{c(z,x,\mathscr{D})}{\sum_{x' \in \mathscr{O}}c(z,x',\mathscr{D})} \\ &= \frac{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\delta(x_t^{(d)},x)P(\mathbf{x}^{(d)},z_t = z;\theta^{old})}{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}P(\mathbf{x}^{(d)},z_t = z;\theta^{old})} \\ \overline{b}_i(k) &= \frac{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\delta(x^{(d)},o_k)\alpha_t(i)\beta_t(i)}{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\alpha_t(i)\beta_t(i)} \end{aligned} p(x∣z)bi(k)=∑x′∈Oc(z,x′,D)c(z,x,D)=∑d=1D∑t=1T(d)P(x(d),zt=z;θold)∑d=1D∑t=1T(d)δ(xt(d),x)P(x(d),zt=z;θold)=∑d=1D∑t=1T(d)αt(i)βt(i)∑d=1D∑t=1T(d)δ(x(d),ok)αt(i)βt(i)