自然语言处理学习笔记-lecture3-隐马尔科夫模型

首先讲解了马尔科夫模型,之后讲述了马尔科夫模型的学习过程,之后深入到隐马尔可夫模型,涉及到观测序列概率的计算和在给定观测序列的情况下最可能的隐状态序列,最后给出了隐马尔可夫模型的学习

马尔科夫模型

状态集合: S = { s 1 , ⋯   , s N } \mathcal{S} = \{s_1,\cdots,s_N\} S={s1,,sN}
观测状态序列: x = x 1 , ⋯   , x t , ⋯   , x T x = x_1,\cdots,x_t,\cdots,x_T x=x1,,xt,,xT,其中 x t ∈ S x_t \in \mathcal{S} xtS
状态初始化概率: π i = p ( x 1 = s i ) , 1 ≤ i ≤ N \pi_i = p(x_1 = s_i),1 \leq i \leq N πi=p(x1=si),1iN
状态转移概率: a i j = p ( x t = s j ∣ x t − 1 = s i ) , 1 ≤ i , j ≤ N a_{ij} = p(x_t = s_j|x_{t - 1} = s_i),1 \leq i,j \leq N aij=p(xt=sjxt1=si),1i,jN
计算观测状态序列的概率(假设当前的状态 x t x_t xt的生成只依赖于前一个状态 x t − 1 x_{t-1} xt1,通常称为二阶马尔科夫模型):
P ( x ; θ ) = ∏ t = 1 T p ( x t ∣ x 1 , ⋯   , x t − 1 ) ≈ p ( x 1 ) × ∏ t = 2 T p ( x t ∣ x t − 1 ) \begin{aligned} P(x;\theta) &= \prod_{t = 1}^Tp(x_t|x_1,\cdots,x_{t - 1}) \\ &\approx p(x_1) \times \prod_{t = 2}^Tp(x_t|x_{t - 1}) \end{aligned} P(x;θ)=t=1Tp(xtx1,,xt1)p(x1)×t=2Tp(xtxt1)
其中 θ = { p ( x ) ∣ x ∈ S } ⋃ { p ( x ′ ∣ x ) ∣ x , x ′ ∈ S } \theta = \{p(x)|x \in \mathcal{S}\} \bigcup \{p(x'|x)|x,x' \in \mathcal{S}\} θ={p(x)xS}{p(xx)x,xS}

模型的学习

目的:学习得到模型的参数 θ \theta θ,也即状态初始化概率和状态转移概率的学习,通过极大似然估计完成参数学习
假设训练集包含 D D D个样本 D = { x ( d ) } d = 1 D \mathscr{D} = \{x^{(d)}\}^D_{d = 1} D={x(d)}d=1D,使用极大似然估计来从训练数据中自动获取最优模型参数:
θ ^ = a r g m a x θ { L ( θ ) } \hat{\theta} = arg \mathop{max}\limits_{\theta}\{L(\theta)\} θ^=argθmax{L(θ)}
似然函数的对数形式:
L ( θ ) = ∑ d = 1 D log ⁡ P ( x ( d ) ; θ ) = ∑ d = 1 D ( log ⁡ p ( x 1 ( d ) ) + ∑ t = 2 T ( d ) log ⁡ p ( x t ( d ) ∣ x t − 1 ( d ) ) ) \begin{aligned} L(\theta) &= \sum_{d = 1}^D \log P(x^{(d)};\theta) \\ &= \sum_{d = 1}^D \left( \log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)}) \right) \end{aligned} L(θ)=d=1DlogP(x(d);θ)=d=1D logp(x1(d))+t=2T(d)logp(xt(d)xt1(d))
其中 T ( d ) T^{(d)} T(d)表示第 d d d个训练数据的序列长度,模型参数还需要满足一下两个约束条件:
∑ x ∈ S p ( x ) = 1 ∀ x : ∑ x ′ ∈ S p ( x ′ ∣ x ) = 1 \sum_{x \in \mathcal{S}} p(x) = 1 \\ \forall x:\sum_{x' \in \mathcal{S}}p(x'|x) = 1 xSp(x)=1x:xSp(xx)=1
引入拉格朗日乘子法来实现约束条件下的极值求解:
J ( θ , λ , γ ) = L ( θ ) − λ ( ∑ x ∈ S p ( x ) − 1 ) − ∑ x ∈ S γ x ( ∑ x ′ ∈ S p ( x ′ ∣ x ) − 1 ) J(\theta,\lambda,\gamma) = L(\theta) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right) J(θ,λ,γ)=L(θ)λ(xSp(x)1)xSγx(xSp(xx)1)
其中 λ \lambda λ是与状态初始概率约束相关的拉格朗日乘子, γ = { γ x ∣ x ∈ S } \gamma = \{\gamma_x|x \in \mathcal{S}\} γ={γxxS}是与状态转移概率约束相关的拉格朗日乘子的集合

  • 首先计算状态初始概率 p ( x ) p(x) p(x)的偏导
    ∂ J ( θ , λ , γ ) ∂ p ( x ) = ( ∑ d = 1 D ( log ⁡ p ( x 1 ( d ) ) + ∑ t = 2 T ( d ) log ⁡ p ( x t ( d ) ∣ x t − 1 ( d ) ) ) − λ ( ∑ x ∈ S p ( x ) − 1 ) − ∑ x ∈ S γ x ( ∑ x ′ ∈ S p ( x ′ ∣ x ) − 1 ) ) p ( x ) ′ = ∂ ∂ p ( x ) ∑ d = 1 D log ⁡ p ( x 1 ( d ) ) − λ = 1 p ( x ) ∑ d = 1 D δ ( x 1 ( d ) , x ) − λ \begin{aligned} \frac{\partial J(\theta,\lambda,\gamma)}{\partial p(x)} &= \left( \sum_{d = 1}^D \left( \log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)}) \right) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right) \right)'_{p(x)} \\ &= \frac{\partial}{\partial p(x)} \sum_{d = 1}^D \log p(x_1^{(d)}) - \lambda \\ &= \frac{1}{p(x)} \sum_{d = 1}^D \delta(x_1^{(d)},x) - \lambda \end{aligned} p(x)J(θ,λ,γ)= d=1D logp(x1(d))+t=2T(d)logp(xt(d)xt1(d)) λ(xSp(x)1)xSγx(xSp(xx)1) p(x)=p(x)d=1Dlogp(x1(d))λ=p(x)1d=1Dδ(x1(d),x)λ
    其中 δ ( a , b ) \delta(a,b) δ(a,b)的取值当 a = b a = b a=b时为1,否则为0
    c ( x , D ) c(x,\mathscr{D}) c(x,D)表示训练数据中第一个状态是 x x x的次数:
    c ( x , D ) = ∑ d = 1 D δ ( x 1 ( d ) , x ) c(x,\mathscr{D}) = \sum_{d = 1}^D \delta(x_1^{(d)},x) c(x,D)=d=1Dδ(x1(d),x)
    之后计算拉格朗日乘子 λ \lambda λ的偏导
    ∂ J ( θ , λ , γ ) ∂ λ = ∑ x ∈ S p ( x ) − 1 \frac{\partial J(\theta,\lambda,\gamma)}{\partial \lambda} = \sum_{x \in \mathcal{S}}p(x) - 1 λJ(θ,λ,γ)=xSp(x)1
    根据上式可以推导:
    p ( x ) = c ( x , D ) λ p(x) = \frac{c(x,\mathscr{D})}{\lambda} p(x)=λc(x,D)
    λ = ∑ x ∈ S c ( x , D ) \lambda = \sum_{x \in \mathcal{S}}c(x,\mathscr{D}) λ=xSc(x,D)所以状态初始化概率的估计公式:
    p ( x ) = c ( x , D ) ∑ x ′ ∈ S c ( x ′ , D ) p(x) = \frac{c(x,\mathscr{D})}{\sum_{x' \in \mathcal{S}}c(x',\mathscr{D})} p(x)=xSc(x,D)c(x,D)
  • 计算状态转移概率的偏导
    ∂ J ( θ , λ , γ ) ∂ p ( x ′ ∣ x ) = ( ∑ d = 1 D ( log ⁡ p ( x 1 ( d ) ) + ∑ t = 2 T ( d ) log ⁡ p ( x t ( d ) ∣ x t − 1 ( d ) ) ) − λ ( ∑ x ∈ S p ( x ) − 1 ) − ∑ x ∈ S γ x ( ∑ x ′ ∈ S p ( x ′ ∣ x ) − 1 ) ) p ( x ′ ∣ x ) ′ = ∂ ∂ p ( x ′ ∣ x ) ∑ d = 1 D ∑ t = 2 T ( d ) log ⁡ p ( x t ( d ) ∣ x t − 1 ( d ) ) − γ x = 1 p ( x ′ ∣ x ) ∑ d = 1 D ∑ t = 2 T ( d ) δ ( x t − 1 ( d ) , x ) δ ( x t ( d ) , x ′ ) − γ x \begin{aligned} \frac{\partial J(\theta,\lambda,\gamma)}{\partial p(x'|x)} &= \left( \sum_{d = 1}^D \left( \log p(x_1^{(d)}) + \sum_{t = 2}^{T^{(d)}} \log p(x_t^{(d)}|x_{t - 1}^{(d)}) \right) - \lambda\left(\sum_{x \in \mathcal{S}} p(x) - 1 \right) - \sum_{x \in \mathcal{S}} \gamma_x \left( \sum_{x' \in \mathcal{S}} p(x'|x) - 1\right) \right)'_{p(x'|x)} \\ &= \frac{\partial}{\partial p(x'|x)} \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \log p\left(x_t^{(d)}|x_{t - 1}^{(d)} \right) - \gamma_x \\ &= \frac{1}{p(x'|x)} \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \delta(x_{t - 1}^{(d)},x) \delta(x_t^{(d)},x') - \gamma_x \end{aligned} p(xx)J(θ,λ,γ)= d=1D logp(x1(d))+t=2T(d)logp(xt(d)xt1(d)) λ(xSp(x)1)xSγx(xSp(xx)1) p(xx)=p(xx)d=1Dt=2T(d)logp(xt(d)xt1(d))γx=p(xx)1d=1Dt=2T(d)δ(xt1(d),x)δ(xt(d),x)γx
    将训练数据中 x ′ x' x紧跟着出现在 x x x后面的次数表示为:
    c ( x , x ′ , D ) = ∑ d = 1 D ∑ t = 2 T ( d ) δ ( x t − 1 ( d ) , x ) δ ( x t ( d ) , x ′ ) c(x,x',\mathscr{D}) = \sum_{d = 1}^D \sum_{t = 2}^{T^{(d)}} \delta(x_{t - 1}^{(d)},x) \delta(x_t^{(d)},x') c(x,x,D)=d=1Dt=2T(d)δ(xt1(d),x)δ(xt(d),x)
    之后计算拉格朗日乘子 γ x \gamma_x γx的偏导:
    ∂ J ( θ , λ , γ ) ∂ γ x = ∑ x ′ ∈ S p ( x ′ ∣ x ) − 1 \frac{\partial J(\theta,\lambda,\gamma)}{\partial \gamma_x} = \sum_{x' \in \mathcal{S}}p(x'|x) - 1 γxJ(θ,λ,γ)=xSp(xx)1
    从上式可以推出:
    p ( x ′ ∣ x ) = c ( x , x ′ , D ) γ x p(x'|x) = \frac{c(x,x',\mathscr{D})}{\gamma_x} p(xx)=γxc(x,x,D)
    又可以得出 γ x = ∑ x ′ ′ ∈ S c ( x , x ′ ′ , D ) \gamma_x = \sum_{x'' \in \mathcal{S}}c(x,x'',\mathscr{D}) γx=x′′Sc(x,x′′,D),故:
    p ( x ′ ∣ x ) = c ( x , x ′ , D ) ∑ x ′ ′ ∈ S c ( x , x ′ ′ , D ) p(x'|x) = \frac{c(x,x',\mathscr{D})}{\sum_{x'' \in \mathcal{S}}c(x,x'',\mathscr{D})} p(xx)=x′′Sc(x,x′′,D)c(x,x,D)

计算观测概率

目的:计算观测序列的概率
一个例子:在暗室中不知道外面的天气,但是可以通过触摸地面的潮湿程度来推测外部的天气情况,此时,地面的潮湿程度是观测状态,外面的天气是隐状态
观测状态集合: O = { o 1 , ⋯   , o m } \mathscr{O} = \{o_1,\cdots,o_m\} O={o1,,om}
隐状态集合: S = { s 1 , ⋯   , s N } \mathcal{S} = \{s_1,\cdots,s_N\} S={s1,,sN}
观测状态序列: x = x 1 , ⋯   , x t , ⋯   , x T x = x_1,\cdots,x_t,\cdots,x_T x=x1,,xt,,xT
隐状态序列: z = z 1 , ⋯   , z t ⋯   , z T z = z_1,\cdots,z_t\cdots,z_T z=z1,,zt,zT
隐状态初始化概率: π i = p ( z 1 = s i ) , 1 ≤ i ≤ N \pi_i = p(z_1 = s_i),1 \leq i \leq N πi=p(z1=si),1iN
隐状态转移概率: a i j = p ( z t = s j ∣ z t − 1 = s i ) , 1 ≤ i , j ≤ N a_{ij} = p(z_t = s_j|z_{t - 1} = s_i),1 \leq i,j \leq N aij=p(zt=sjzt1=si),1i,jN
观测状态生成概率: b j ( k ) = p ( x t = o k ∣ z t = s j ) , 1 ≤ j ≤ N ⋀ 1 ≤ k ≤ M b_j(k) = p(x_t = o_k|z_t = s_j),1 \leq j \leq N \bigwedge 1 \leq k \leq M bj(k)=p(xt=okzt=sj),1jN1kM

隐马尔可夫模型:

P ( x ; θ ) = ∑ z P ( x , z ; θ ) = ∑ z p ( z 1 ) × p ( x 1 ∣ z 1 ) × ∏ t = 2 T p ( z t ∣ z t − 1 ) × p ( x t ∣ z t ) \begin{aligned} P(x;\theta) &= \sum_zP(x,z;\theta) \\ &= \sum_zp(z_1) \times p(x_1|z_1) \times \prod_{t = 2}^Tp(z_t|z_{t - 1}) \times p(x_t|z_t) \end{aligned} P(x;θ)=zP(x,z;θ)=zp(z1)×p(x1z1)×t=2Tp(ztzt1)×p(xtzt)
模型参数 θ = { p ( z ) ∣ z ∈ S } ⋃ { p ( z ′ ∣ z ) ∣ z , z ′ ∈ S } ⋃ { p ( x ∣ z ) ∣ x ∈ O ⋀ z ∈ S } \theta = \{p(z)|z \in \mathcal{S}\} \bigcup \{p(z'|z)|z,z' \in \mathcal{S}\} \bigcup \{p(x|z)|x \in \mathscr{O} \bigwedge z \in \mathcal{S}\} θ={p(z)zS}{p(zz)z,zS}{p(xz)xOzS}

前向概率

部分观测状态序列 x 1 , ⋯   , x t x_1,\cdots,x_t x1,,xt与第 t t t个隐状态为 s i s_i si的联合概率称为前向概率:
α t ( i ) = P ( x 1 , ⋯   , x t , z t = s i ; θ ) \alpha_t(i) = P(x_1,\cdots,x_t,z_t = s_i;\theta) αt(i)=P(x1,,xt,zt=si;θ)
使用动态规划算法递归计算:

  • 初始化: t = 1 t = 1 t=1
    α 1 ( i ) = π i b i ( x 1 ) , 1 ≤ i ≤ N \alpha_1(i) = \pi_ib_i(x_1),1 \leq i \leq N α1(i)=πibi(x1),1iN
  • 递归: t = 2 , ⋯   , T t = 2,\cdots,T t=2,,T
    α t ( j ) = ( ∑ i = 1 N α t − 1 ( i ) a i j ) b j ( x t ) , 1 ≤ j ≤ N \alpha_t(j) = \left( \sum_{i = 1}^N\alpha_{t - 1}(i)a_{ij}\right)b_j(x_t),1 \leq j \leq N αt(j)=(i=1Nαt1(i)aij)bj(xt),1jN
  • 终止:
    P ( x ; θ ) = ∑ i = 1 N α T ( i ) P(x;\theta) = \sum_{i = 1}^N\alpha_T(i) P(x;θ)=i=1NαT(i)

后向概率

t t t个隐状态为 s j s_j sj生成部分观测状态序列 x t + 1 , ⋯   , x T x_{t + 1},\cdots,x_T xt+1,,xT的条件概率称为后向概率,定义为:
β t ( i ) = P ( x t + 1 , ⋯   , x T ∣ z t = s i ; θ ) \beta_t(i) = P(x_{t + 1},\cdots,x_T|z_t = s_i;\theta) βt(i)=P(xt+1,,xTzt=si;θ)
使用动态规划算法递归计算如下:

  • 初始化: t = T t = T t=T
    β T ( i ) = 1 , 1 ≤ i ≤ N \beta_T(i) = 1,1 \leq i \leq N βT(i)=1,1iN
  • 递归: t = T − 1 , ⋯   , 1 t = T - 1,\cdots,1 t=T1,,1
    β t ( i ) = ∑ j = 1 N a i j b j ( x t + 1 ) β t + 1 ( j ) , 1 ≤ i ≤ N \beta_t(i) = \sum_{j = 1}^Na_{ij}b_j(x_{t + 1})\beta_{t + 1}(j),1 \leq i \leq N βt(i)=j=1Naijbj(xt+1)βt+1(j),1iN
  • 终止:
    P ( x ; θ ) = ∑ i = 1 N π i b i ( x 1 ) β 1 ( i ) P(x;\theta) = \sum_{i = 1}^N\pi_ib_i(x_1)\beta_1(i) P(x;θ)=i=1Nπibi(x1)β1(i)

计算最优隐状态序列-Viterbi算法

目的:在给定一个观测状态序列 x = x 1 , ⋯   , x t , ⋯   , x T x = x_1,\cdots,x_t,\cdots,x_T x=x1,,xt,,xT和模型参数 θ \theta θ的条件下求出最优的隐状态序列
z ^ = a r g m a x z { P ( z ∣ x ; θ ) } = a r g m a x z { P ( x , z ; θ ) P ( x ; θ ) } = a r g m a x z { P ( x , z ; θ ) } = a r g m a x z { ∑ z p ( z 1 ) × p ( x 1 ∣ z 1 ) × ∏ t = 2 T p ( z t ∣ z t − 1 ) × p ( x t ∣ z t ) } \begin{aligned} \hat{z} &= arg \mathop{max}\limits_z\left\{P(z|x;\theta)\right\} \\ &= arg \mathop{max}\limits_z\left\{\frac{P(x,z;\theta)}{P(x;\theta)}\right\} \\ &= arg \mathop{max}\limits_z\left\{P(x,z;\theta)\right\} \\ &= arg \mathop{max}\limits_z\left\{\sum_zp(z_1) \times p(x_1|z_1) \times \prod_{t = 2}^Tp(z_t|z_{t - 1}) \times p(x_t|z_t)\right\} \\ \end{aligned} z^=argzmax{P(zx;θ)}=argzmax{P(x;θ)P(x,z;θ)}=argzmax{P(x,z;θ)}=argzmax{zp(z1)×p(x1z1)×t=2Tp(ztzt1)×p(xtzt)}
假设 δ i = m a x j ∈ h e a d s ( i ) { ω j i δ j } \delta_i = \mathop{max}\limits_{j \in heads(i)}\{\omega_{ji}\delta_j\} δi=jheads(i)max{ωjiδj}是从结点1到结点 i i i的最大路径取值, ψ i = a r g m a x j ∈ h e a d s ( i ) { ω j i δ j } \psi_i = arg \mathop{max}\limits_{j \in heads(i)}\{\omega_{ji}\delta_j\} ψi=argjheads(i)max{ωjiδj}

Viterbi算法

  • 初始化:
    δ 1 ( i ) = π i b 1 ( x 1 ) , ψ 1 ( i ) = 0 \delta_1(i) = \pi_ib_1(x_1),\psi_1(i) = 0 δ1(i)=πib1(x1),ψ1(i)=0
  • 递归: t = 2 , ⋯   , T t = 2,\cdots,T t=2,,T
    δ t ( j ) = m a x 1 ≤ i ≤ N { δ t − 1 ( i ) a i j } b j ( x t ) ψ t ( j ) = a r g m a x 1 ≤ i ≤ N { δ t − 1 ( i ) a i j } b j ( x t ) \delta_t(j) = \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{t - 1}(i)a_{ij}\}b_j(x_t) \\ \psi_t(j) = arg \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{t - 1}(i)a_{ij}\}b_j(x_t) δt(j)=1iNmax{δt1(i)aij}bj(xt)ψt(j)=arg1iNmax{δt1(i)aij}bj(xt)
  • 结束:
    P ^ = m a x 1 ≤ i ≤ N { δ T ( i ) } z ^ T = a r g m a x 1 ≤ i ≤ N { δ T ( i ) } \hat{P} = \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{T}(i)\} \\ \hat{z}_T = arg \mathop{max}\limits_{1 \leq i \leq N}\{\delta_{T}(i)\} P^=1iNmax{δT(i)}z^T=arg1iNmax{δT(i)}
  • 回溯: t = T − 1 , ⋯   , 1 t = T - 1,\cdots,1 t=T1,,1
    z ^ t = ψ t + 1 ( z ^ t + 1 ) \hat{z}_t = \psi_{t + 1}(\hat{z}_{t + 1}) z^t=ψt+1(z^t+1)

模型的学习-前向后向算法

目的:估计模型参数,我们知道的是观测序列,隐状态序列是不确定的,所以参数的主要挑战是需要对指数级的隐状态序列进行求和
给定训练集 D = { x ( d ) } d = 1 D \mathscr{D} = \{x^{(d)}\}^D_{d = 1} D={x(d)}d=1D,使用极大似然估计来获得模型的最优参数:
θ ^ = a r g m a x θ { L ( θ ) } \hat{\theta} = arg \mathop{max}\limits_{\theta}\{L(\theta)\} θ^=argθmax{L(θ)}
Expectation-Maximization(简称EM)算法被广泛用于估计隐状态模型的参数。令 X \mathbf{X} X表示一组观测数据, Z \mathbf{Z} Z表示未观测数据,也就是隐状态序列:EM算法在以下两个步骤中迭代运行:

  • E步:确定对数似然的期望值
    Q ( θ ∣ θ o l d ) = E Z ∣ X ; θ o l d [ log ⁡ P ( X , Z ; θ ) ] \mathbf{Q}(\theta|\theta^{old}) = \mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\log P(\mathbf{X},\mathbf{Z};\theta)\right] Q(θθold)=EZX;θold[logP(X,Z;θ)]
  • M步:计算最大化该期望值的参数
    θ n e w = a r g m a x θ { Q ( θ ∣ θ o l d ) } \theta^{new} = arg \mathop{max}\limits_{\theta}\left\{\mathbf{Q}(\theta|\theta^{old})\right\} θnew=argθmax{Q(θθold)}
    那么使用EM算法来训练隐马尔可夫模型,在E步实际使用的目标函数定义如下:
    J ( θ , λ , γ , ϕ ) = ∑ d = 1 D E Z ∣ X ( d ) ; θ o l d [ log ⁡ P ( x ( d ) , Z ; θ ) ] − λ ( ∑ z ∈ S p ( z ) − 1 ) − ∑ z ∈ S γ z ( ∑ z ′ ∈ S p ( z ′ ∣ z ) − 1 ) − ∑ z ∈ S ϕ z ( ∑ x ∈ O p ( x ∣ z ) − 1 ) \begin{aligned} J(\theta,\lambda,\gamma,\phi) &= \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X}^{(d)};\theta^{old}}\left[\log P(\mathbf{x}^{(d)},\mathbf{Z};\theta)\right] \\ &- \lambda\left(\sum_{z \in \mathcal{S}}p(z) - 1\right) \\ &- \sum_{z \in \mathcal{S}}\gamma_z\left(\sum_{z' \in \mathcal{S}}p(z'|z) - 1\right) \\ &- \sum_{z \in \mathcal{S}}\phi_z\left(\sum_{x \in \mathscr{O}}p(x|z) - 1\right) \end{aligned} J(θ,λ,γ,ϕ)=d=1DEZX(d);θold[logP(x(d),Z;θ)]λ(zSp(z)1)zSγz(zSp(zz)1)zSϕz(xOp(xz)1)
    通过计算偏导,可以得到公式:
    p ( z ) = c ( z , D ) ∑ z ′ ∈ S c ( z ′ , D ) p ( z ′ ∣ z ) = c ( z , z ′ , D ) ∑ z ′ ′ ∈ S c ( z , z ′ ′ , D ) p ( x ∣ z ) = c ( z , x , D ) ∑ x ′ ∈ O c ( z , x ′ , D ) p(z) = \frac{c(z,\mathscr{D})}{\sum_{z' \in \mathcal{S}}c(z',\mathscr{D})} \\ p(z'|z) = \frac{c(z,z',\mathscr{D})}{\sum_{z'' \in \mathcal{S}}c(z,z'',\mathscr{D})} \\ p(x|z) = \frac{c(z,x,\mathscr{D})}{\sum_{x' \in \mathscr{O}}c(z,x',\mathscr{D})} p(z)=zSc(z,D)c(z,D)p(zz)=z′′Sc(z,z′′,D)c(z,z,D)p(xz)=xOc(z,x,D)c(z,x,D)
    其中 c ( ⋅ ) c(\cdot) c()表示计数函数, c ( z , D ) c(z,\mathscr{D}) c(z,D)是在训练集 D \mathscr{D} D上第一个隐状态是 z z z的次数的期望值, c ( z , z ′ , D ) c(z,z',\mathscr{D}) c(z,z,D)是在训练集上隐状态 z ′ z' z出现在隐状态 z z z的次数的期望值, c ( z , x , D ) c(z,x,\mathscr{D}) c(z,x,D)是训练集上隐状态 z z z生成观测状态 x x x的次数的期望值。
    上述的期望值定义如下:
    c ( z , D ) ≡ ∑ d = 1 D E Z ∣ X ; θ o l d [ δ ( z 1 , z ) ] c ( z , z ′ , D ) ≡ ∑ d = 1 D E Z ∣ X ; θ o l d [ ∑ t = 2 T ( d ) δ ( z t − 1 , z ) δ ( z t , z ′ ) ] c ( z , x , D ) ≡ ∑ d = 1 D E Z ∣ X ; θ o l d [ ∑ t = 1 T ( d ) δ ( z t , z ) δ ( x t ( d ) , x ) ] \begin{aligned} c(z,\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\delta(z_1,z)\right] \\ c(z,z',\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\sum_{t = 2}^{T^{(d)}}\delta(z_{t - 1},z)\delta(z_t,z')\right] \\ c(z,x,\mathscr{D}) &\equiv \sum_{d = 1}^D\mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\sum_{t = 1}^{T{(d)}}\delta(z_t,z)\delta(x_t^{(d)},x)\right] \end{aligned} c(z,D)c(z,z,D)c(z,x,D)d=1DEZX;θold[δ(z1,z)]d=1DEZX;θold t=2T(d)δ(zt1,z)δ(zt,z) d=1DEZX;θold t=1T(d)δ(zt,z)δ(xt(d),x)
    期望基于隐状态的后验概率 P ( z ∣ x ( d ) ; θ o l d ) P(\mathbf{z}|\mathbf{x}^{(d)};\theta^{old}) P(zx(d);θold),计算期望的过程涉及指数级数量的计算,下面以隐状态转换次数的期望为例:
    E Z ∣ X ; θ o l d [ δ ( z t − 1 , z ) δ ( z t , z ′ ) ] = ∑ z P ( z ∣ x ( d ) ; θ o l d ) δ ( z t − 1 , z ) δ ( z t , z ′ ) = ∑ z P ( x ( d ) , z ; θ o l d ) P ( x ( d ) ; θ o l d ) δ ( z t − 1 , z ) δ ( z t , z ′ ) = 1 P ( x ( d ) ; θ o l d ) ∑ z P ( x ( d ) , z ; θ o l d ) δ ( z t − 1 , z ) δ ( z t , z ′ ) = P ( x ( d ) , z t − 1 = z , z t = z ′ ; θ o l d ) P ( x ( d ) ; θ o l d ) \begin{aligned} \mathbb{E}_{\mathbf{Z}|\mathbf{X};\theta^{old}}\left[\delta(z_{t - 1},z)\delta(z_t,z')\right] &= \sum_{\mathbf{z}}P(\mathbf{z}|\mathbf{x}^{(d)};\theta^{old})\delta(z_{t - 1},z)\delta(z_t,z') \\ &= \sum_{\mathbf{z}}\frac{P(\mathbf{x}^{(d)},\mathbf{z};\theta^{old})}{P(\mathbf{x}^{(d)};\theta^{old})}\delta(z_{t - 1},z)\delta(z_t,z') \\ &= \frac{1}{P(\mathbf{x}^{(d)};\theta^{old})}\sum_{\mathbf{z}}P(\mathbf{x}^{(d)},\mathbf{z};\theta^{old})\delta(z_{t - 1},z)\delta(z_t,z') \\ &= \frac{P(\mathbf{x}^{(d)},z_{t - 1} = z,z_t = z';\theta^{old})}{P(\mathbf{x}^{(d)};\theta^{old})} \end{aligned} EZX;θold[δ(zt1,z)δ(zt,z)]=zP(zx(d);θold)δ(zt1,z)δ(zt,z)=zP(x(d);θold)P(x(d),z;θold)δ(zt1,z)δ(zt,z)=P(x(d);θold)1zP(x(d),z;θold)δ(zt1,z)δ(zt,z)=P(x(d);θold)P(x(d),zt1=z,zt=z;θold)
    上式的分母可以使用前向概率来计算,下面是分母的计算:
    P ( x , z t − 1 = s i , z t = s j ; θ ) = P ( x 1 , ⋯   , x t − 1 , z t − 1 = s i ; θ ) × P ( z t = s j ∣ z t − 1 = s i ; θ ) × P ( x t ∣ z t = s j ; θ ) × P ( x t + 1 , ⋯   , x T ∣ z t = s j ; θ ) = α t − 1 ( i ) a i j b j ( x t ) β t ( j ) \begin{aligned} P(\mathbf{x},z_{t - 1} = s_i,z_t = s_j;\theta) = &P(x_1,\cdots,x_{t - 1},z_{t - 1} = s_i;\theta) \times \\ &P(z_t = s_j|z_{t - 1} = s_i;\theta) \times \\ &P(x_t|z_t = s_j;\theta) \times \\ &P(x_{t + 1},\cdots,x_T|z_t = s_j;\theta) \\ = &\alpha_{t - 1}(i)a_{ij}b_j(x_t)\beta_t(j) \end{aligned} P(x,zt1=si,zt=sj;θ)==P(x1,,xt1,zt1=si;θ)×P(zt=sjzt1=si;θ)×P(xtzt=sj;θ)×P(xt+1,,xTzt=sj;θ)αt1(i)aijbj(xt)βt(j)

估计隐状态初始化概率

p ( z ) = c ( z , D ) ∑ z ′ ∈ S c ( z ′ , D ) = ∑ d = 1 D P ( x ( d ) , z 1 = z ; θ o l d ) ∑ d = 1 D P ( x ( d ) ; θ o l d ) π ‾ i = ∑ d = 1 D α 1 ( i ) β 1 ( i ) ∑ d = 1 D ∑ i = 1 N α T ( d ) ( i ) \begin{aligned} p(z) &= \frac{c(z,\mathscr{D})}{\sum_{z' \in \mathcal{S}}c(z',\mathscr{D})} \\ &= \frac{\sum_{d = 1}^DP(\mathbf{x}^{(d)},z_1 = z;\theta^{old})}{\sum_{d = 1}^DP(\mathbf{x}^{(d)};\theta^{old})} \\ \overline{\pi}_i &= \frac{\sum_{d = 1}^D\alpha_1(i)\beta_1(i)}{\sum_{d = 1}^D\sum_{i = 1}^N\alpha_{T^{(d)}}(i)} \end{aligned} p(z)πi=zSc(z,D)c(z,D)=d=1DP(x(d);θold)d=1DP(x(d),z1=z;θold)=d=1Di=1NαT(d)(i)d=1Dα1(i)β1(i)

估计隐状态转换概率

p ( z ′ ∣ z ) = c ( z , z ′ , D ) ∑ z ′ ′ ∈ S c ( z , z ′ ′ , D ) = ∑ d = 1 D ∑ t = 2 T ( d ) P ( x , z t − 1 = z , z t = z ′ ; θ o l d ) ∑ d = 1 D ∑ t = 2 T ( d ) P ( x , z t − 1 = z ; θ o l d ) a ‾ i j = ∑ d = 1 D ∑ t = 2 T ( d ) α t − 1 ( i ) a i j b j ( x t ( d ) ) β t ( j ) ∑ d = 1 D ∑ t = 2 T ( d ) α t − 1 ( i ) β t ( j ) \begin{aligned} p(z'|z) &= \frac{c(z,z',\mathscr{D})}{\sum_{z'' \in \mathcal{S}}c(z,z'',\mathscr{D})} \\ &= \frac{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}P(\mathbf{x},z_{t - 1} = z,z_t = z';\theta^{old}) }{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}P(\mathbf{x},z_{t - 1} = z;\theta^{old}) }\\ \overline{a}_{ij} &= \frac{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}\alpha_{t - 1}(i)a_{ij}b_j(x^{(d)}_t)\beta_t(j)}{\sum_{d = 1}^D\sum_{t = 2}^{T^{(d)}}\alpha_{t - 1}(i)\beta_t(j)} \end{aligned} p(zz)aij=z′′Sc(z,z′′,D)c(z,z,D)=d=1Dt=2T(d)P(x,zt1=z;θold)d=1Dt=2T(d)P(x,zt1=z,zt=z;θold)=d=1Dt=2T(d)αt1(i)βt(j)d=1Dt=2T(d)αt1(i)aijbj(xt(d))βt(j)

估计观测状态生成概率

p ( x ∣ z ) = c ( z , x , D ) ∑ x ′ ∈ O c ( z , x ′ , D ) = ∑ d = 1 D ∑ t = 1 T ( d ) δ ( x t ( d ) , x ) P ( x ( d ) , z t = z ; θ o l d ) ∑ d = 1 D ∑ t = 1 T ( d ) P ( x ( d ) , z t = z ; θ o l d ) b ‾ i ( k ) = ∑ d = 1 D ∑ t = 1 T ( d ) δ ( x ( d ) , o k ) α t ( i ) β t ( i ) ∑ d = 1 D ∑ t = 1 T ( d ) α t ( i ) β t ( i ) \begin{aligned} p(x|z) &= \frac{c(z,x,\mathscr{D})}{\sum_{x' \in \mathscr{O}}c(z,x',\mathscr{D})} \\ &= \frac{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\delta(x_t^{(d)},x)P(\mathbf{x}^{(d)},z_t = z;\theta^{old})}{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}P(\mathbf{x}^{(d)},z_t = z;\theta^{old})} \\ \overline{b}_i(k) &= \frac{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\delta(x^{(d)},o_k)\alpha_t(i)\beta_t(i)}{\sum_{d = 1}^D\sum_{t = 1}^{T^{(d)}}\alpha_t(i)\beta_t(i)} \end{aligned} p(xz)bi(k)=xOc(z,x,D)c(z,x,D)=d=1Dt=1T(d)P(x(d),zt=z;θold)d=1Dt=1T(d)δ(xt(d),x)P(x(d),zt=z;θold)=d=1Dt=1T(d)αt(i)βt(i)d=1Dt=1T(d)δ(x(d),ok)αt(i)βt(i)

你可能感兴趣的:(自然语言处理,自然语言处理,学习,机器学习)