【10.1算法理论部分(3)学习问题(Baum-Welch算法)】Hidden Markov Algorithm——李航《统计学习方法》公式推导

10.3学习问题(解决Learining: λ M L E = a r g m a x λ P ( O ∣ λ ) \lambda_{MLE} = argmax_{\lambda}P(O|\lambda) λMLE=argmaxλP(Oλ)

10.3.1 监督学习方法

假设已给出训练数据包含 S 个长度相同的观测序列和对应的状态序列 ( O 1 , I 1 ) , ( O 2 , I 2 ) , ⋅ ⋅ ⋅ , ( O T , I T ) {(O_{1},I_{1}),(O_{2},I_{2}), \cdot \cdot \cdot ,(O_{T},I_{T})} (O1,I1),(O2,I2),,(OT,IT),那么可以利用极大似然估计法来估计隐马尔可夫模型的参数,具体方法如下:

  1. 转移概率 a i j a_{ij} aij的估计:
    a i j = A i j ∑ j = 1 N A i j − − − − ( 10.30 ) a_{ij} = \frac{A_{ij}}{\sum_{j=1}^{N}A_{ij}}----(10.30) aij=j=1NAijAij(10.30)
    其中, A i j A_{ij} Aij为样本中时刻 t 处于状态 q i q_{i} qi而到时刻t+1转移到状态 q j q_{j} qj的频数;
  2. 观测概率 b j ( k ) b_{j}(k) bj(k)的估计:
    b j k = B j k ∑ k = 1 M A j k − − − − ( 10.31 ) b_{jk} = \frac{B_{jk}}{\sum_{k=1}^{M}A_{jk}}----(10.31) bjk=k=1MAjkBjk(10.31)
    其中, B j k B_{jk} Bjk为样本中状态为 q j q_{j} qj,其对应观测为 v k v_{k} vk的频数;
  3. 初始状态概率 π i \pi_{i} πi的估计为 S 个样本中初始状态为 q i q_{i} qi的频率。

显然此训练数据中的状态序列数据通常是需要人工标注出来的,因此代价较高,所以非监督学习的方法更为实用,例如Baum-Welch算法。

10.3.2 Baum-Welch算法

如果只有观测序列数据 O = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T ) O = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T}) O=(o1,o2,,oT),而没有状态序列数据 S = ( s 1 , s 2 , ⋅ ⋅ ⋅ , s T ) S = (s_{1},s_{2}, \cdot \cdot \cdot,s_{T}) S=(s1,s2,,sT),那么隐马尔可夫模型就是一个含有隐变量的概率模型:
P ( O ∣ λ ) = ∑ S P ( O ∣ I , λ ) P ( I ∣ λ ) − − − − ( 10.32 ) P(O|λ) = \sum_{S}P(O|I,\lambda)P(I|\lambda)----(10.32) P(Oλ)=SP(OI,λ)P(Iλ)(10.32)
如果要对它进行参数估计,则可以采用EM算法来实现,具体步骤如下:

  1. 确定完全数据的对数似然函数
    此时观测数据为 O = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T ) O = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T}) O=(o1,o2,,oT),未观测数据为 S = ( s 1 , s 2 , ⋅ ⋅ ⋅ , s T ) S = (s_{1},s_{2}, \cdot \cdot \cdot,s_{T}) S=(s1,s2,,sT),则完全数据为 ( O , I ) = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T , s 1 , s 2 , ⋅ ⋅ ⋅ , s T ) (O,I) = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T},s_{1},s_{2}, \cdot \cdot \cdot,s_{T}) (O,I)=(o1,o2,,oT,s1,s2,,sT),完全数据的对数似然函数为:
    l o g P ( O , I ∣ λ ) logP(O,I|\lambda) logP(O,Iλ)
    其中, P ( O , I ∣ λ ) = π s 1 b s 1 ( o 1 ) a s 1 s 2 b s 2 ( o 2 ) ⋅ ⋅ ⋅ a s T − 1 s T b s T ( o T ) P(O,I|\lambda) = \pi_{s_{1}}b_{s_{1}}(o_{1})a_{s_{1}s_{2}}b_{s_{2}}(o_{2}) \cdot \cdot \cdot a_{s_{T-1}s_{T}}b_{s_{T}}(o_{T}) P(O,Iλ)=πs1bs1(o1)as1s2bs2(o2)asT1sTbsT(oT),所以可以进一步推得:
    l o g P ( O , I ∣ λ ) logP(O,I|\lambda) logP(O,Iλ)
    = l o g ( π s 1 b s 1 ( o 1 ) a s 1 s 2 b s 2 ( o 2 ) ⋅ ⋅ ⋅ a s T − 1 s T b s T ( o T ) ) = log(\pi_{s_{1}}b_{s_{1}}(o_{1})a_{s_{1}s_{2}}b_{s_{2}}(o_{2}) \cdot \cdot \cdot a_{s_{T-1}s_{T}}b_{s_{T}}(o_{T})) =log(πs1bs1(o1)as1s2bs2(o2)asT1sTbsT(oT))
    = l o g π s 1 + ∑ t = 1 T − 1 l n a s t s t + 1 + ∑ t = 1 T l n b s t ( o t ) − − − − ( 10.33 ∗ ) = log\pi_{s_{1}} + \sum_{t =1}^{T-1}lna_{s_{t}s_{t+1}} + \sum_{t=1}^{T}lnb_{s_{t}}(o_{t})----(10.33*) =logπs1+t=1T1lnastst+1+t=1Tlnbst(ot)(10.33)

  2. EM算法E步:求Q函数 Q ( λ , λ ( t ) ) Q(\lambda,\lambda^{(t)}) Q(λ,λ(t))
    Q ( λ , λ ( t ) ) = ∑ S P ( O , S ∣ λ ( t ) ) l o g P ( O , S ∣ λ ) − − − − ( 10.33 ) Q(\lambda,\lambda^{(t)}) = \sum_{S}P(O,S|\lambda^{(t)})logP(O,S|\lambda)----(10.33) Q(λ,λ(t))=SP(O,Sλ(t))logP(O,Sλ)(10.33)
    其中, λ ( t ) \lambda^{(t)} λ(t)是隐马尔可夫模型参数的当前估计值,λ 是要极大化的隐马尔可夫模型参数。为了便于后续计算,Q 函数还可以作如下恒等变形,将(10.33*)代入:
    Q ( λ , λ ( t ) ) = ∑ S P ( O , S ∣ λ ( t ) ) l o g π s 1 + ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 + ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T l o g b s t ( o t ) − − − − ( 10.34 ) Q(\lambda,\lambda^{(t)})= \sum_{S}P(O,S|\lambda^{(t)})log\pi_{s_{1}} + \sum_{S}P(O,S|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} + \sum_{S}P(O,S|\lambda^{(t)})\sum_{t=1}^{T}logb_{s_{t}}(o_{t})----(10.34) Q(λ,λ(t))=SP(O,Sλ(t))logπs1+SP(O,Sλ(t))t=1T1logastst+1+SP(O,Sλ(t))t=1Tlogbst(ot)(10.34)

  3. EM算法的M步:极大化Q函数 Q ( λ , λ ( t ) ) Q(\lambda,\lambda^{(t)}) Q(λ,λ(t))求模型参数 A , B , π A,B,\pi A,B,π
    (1)只有式(10.34)的第1项含有 π s i \pi_{s_{i}} πsi,根据第一项对参数 π \pi π进行求最大化,更新 π \pi π的值,具体推导如下:
    π ( t + 1 ) = a r g m a x π Q ( λ , λ ( t ) ) \pi^{(t+1)} = argmax_{\pi}Q(\lambda,\lambda^{(t)}) π(t+1)=argmaxπQ(λ,λ(t))
    = a r g m a x π ∑ S P ( O , S ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{S}P(O,S|\lambda^{(t)})log\pi_{s_{1}} =argmaxπSP(O,Sλ(t))logπs1
    = a r g m a x π ∑ q 1 ∑ q 2 ⋅ ⋅ ⋅ ∑ q T P ( O , s 1 = q 1 , s 2 = q 2 , ⋅ ⋅ ⋅ , s T = q T , ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1} = q_{1},s_{2} = q_{2},\cdot \cdot \cdot ,s_{T} = q_{T},|\lambda^{(t)})log\pi_{s_{1}} =argmaxπq1q2qTP(O,s1=q1,s2=q2,,sT=qT,λ(t))logπs1
    = a r g m a x π ∑ q 1 P ( O , s 1 = q 1 ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{q_{1}}P(O,s_{1} = q_{1}|\lambda^{(t)})log\pi_{s_{1}} =argmaxπq1P(O,s1=q1λ(t))logπs1
    = a r g m a x π ∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) l o g π i = argmax_{\pi}\sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)})log\pi_{i} =argmaxπi=1NP(O,s1=qiλ(t))logπi
    (这里隐含了一个约束 s . t . ∑ i = 1 N π i = 1 s.t. \sum_{i=1}^{N}\pi_{i} = 1 s.t.i=1Nπi=1)
    利用拉格朗日乘子法,进行后续计算,先构造 δ ( π , η 1 ) \delta(\pi,\eta_{1}) δ(π,η1)
    δ ( π , η ) = ∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) l o g π i + η 1 ( ∑ i = 1 N π − 1 ) \delta(\pi,\eta) = \sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)})log\pi_{i} + \eta_{1}(\sum_{i=1}^{N}\pi - 1) δ(π,η)=i=1NP(O,s1=qiλ(t))logπi+η1(i=1Nπ1)
    π i \pi_{i} πi求偏导,令其为0:
    ∂ δ ∂ π i = 1 π i P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 = 0 − − − − ( 10.35 ) \frac{\partial \delta}{\partial \pi_{i}} = \frac{1}{\pi_{i}}P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} = 0----(10.35) πiδ=πi1P(O,s1=qiλ(t))+η1=0(10.35)
    P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 π i = 0 P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} \pi_{i} = 0 P(O,s1=qiλ(t))+η1πi=0
    因为 ∑ i = 1 N π i = 1 \sum_{i=1}^{N}\pi_{i} = 1 i=1Nπi=1,对两边求和:
    ∑ i = 1 N [ P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 π i ] = 0 \sum_{i=1}^{N} \left[ P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} \pi_{i} \right ] = 0 i=1N[P(O,s1=qiλ(t))+η1πi]=0
    ∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) + ∑ i = 1 N η 1 π i = 0 \sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)}) + \sum_{i=1}^{N}\eta_{1} \pi_{i} = 0 i=1NP(O,s1=qiλ(t))+i=1Nη1πi=0
    P ( O ∣ λ ( t ) ) + η 1 = 0 P(O|\lambda^{(t)}) + \eta_{1} = 0 P(Oλ(t))+η1=0
    η 1 = − P ( O ∣ λ ( t ) ) − − − − ( 10.35 ∗ ) \eta_{1} = -P(O|\lambda^{(t)})----(10.35*) η1=P(Oλ(t))(10.35)
    将(10.35*)代入(10.35):
    1 π i P ( O , s 1 = q i ∣ λ ( t ) ) − P ( O ∣ λ ( t ) ) = 0 \frac{1}{\pi_{i}}P(O,s_{1} = q_{i}|\lambda^{(t)}) - P(O|\lambda^{(t)}) = 0 πi1P(O,s1=qiλ(t))P(Oλ(t))=0
    π i = P ( O , s 1 = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) − − − − ( 10.36 ) \pi_{i} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})}----(10.36) πi=P(Oλ(t))P(O,s1=qiλ(t))(10.36)
    由于 π ( t + 1 ) = a r g m a x π Q ( λ , λ ( t ) ) \pi^{(t+1)} = argmax_{\pi}Q(\lambda,\lambda^{(t)}) π(t+1)=argmaxπQ(λ,λ(t)),所以得到更新后的 π ( t + 1 ) \pi^{(t+1)} π(t+1)
    π ( t + 1 ) = P ( O , s 1 = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) \pi^{(t+1)} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})} π(t+1)=P(Oλ(t))P(O,s1=qiλ(t))
    最终更新得到整个初始概率向量 π \pi π
    π ( t + 1 ) = ( π 1 ( t + 1 ) , π 2 ( t + 1 ) , ⋅ ⋅ ⋅ , π N ( t + 1 ) ) \pi^{(t+1)} = (\pi_{1}^{(t+1)},\pi_{2}^{(t+1)}, \cdot \cdot \cdot ,\pi_{N}^{(t+1)}) π(t+1)=(π1(t+1),π2(t+1),,πN(t+1))

(2)只有式(10.34)的第2项含有 a i j a_{ij} aij,根据第二项对参数 a i j a_{ij} aij进行求最大化,更新 a i j a_{ij} aij的值,具体推导如下:
a i j t + 1 = a r g m a x a i j Q ( λ , λ ( t ) ) a_{ij}^{t+1} = argmax_{a_{ij}}Q(\lambda,\lambda^{(t)}) aijt+1=argmaxaijQ(λ,λ(t))
= a r g m a x a i j ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 = argmax_{a_{ij}}\sum_{S}P(O,S|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} =argmaxaijSP(O,Sλ(t))t=1T1logastst+1
= a r g m a x a i j ∑ q 1 ∑ q 2 ⋅ ⋅ ⋅ ∑ q T P ( O , s 1 = q 1 , s 2 = q 2 , ⋅ ⋅ ⋅ , s T = q T ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 = argmax_{a_{ij}}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1} = q_{1},s_{2} = q_{2},\cdot \cdot \cdot ,s_{T} = q_{T}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} =argmaxaijq1q2qTP(O,s1=q1,s2=q2,,sT=qTλ(t))t=1T1logastst+1
= a r g m a x a i j ∑ q t ∑ q t + 1 P ( O , s t = q t , s t + 1 = q t + 1 ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 = argmax_{a_{ij}}\sum_{q_{t}}\sum_{q_{t+1}} P(O,s_{t} = q_{t},s_{t+1} = q_{t+1}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} =argmaxaijqtqt+1P(O,st=qt,st+1=qt+1λ(t))t=1T1logastst+1
= a r g m a x a i j ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a i j = argmax_{a_{ij}}\sum_{i=1}^{N}\sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{ij} =argmaxaiji=1Nj=1Nt=1T1P(O,st=qi,st+1=qjλ(t))t=1T1logaij
(这里隐含了一个约束 s . t . ∑ i = 1 N a i j = 1 s.t. \sum_{i=1}^{N} a_{ij} = 1 s.t.i=1Naij=1)
利用拉格朗日乘子法,进行后续计算,先构造 δ ( a i j , η 2 ) \delta(a_{ij},\eta_{2}) δ(aij,η2)
δ ( a i j , η ) = ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a i j + η 2 ( ∑ i = 1 N a i j − 1 ) \delta(a_{ij},\eta) = \sum_{i=1}^{N}\sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{ij} + \eta_{2}(\sum_{i=1}^{N} a_{ij} - 1) δ(aij,η)=i=1Nj=1Nt=1T1P(O,st=qi,st+1=qjλ(t))t=1T1logaij+η2(i=1Naij1)
a i j a_{ij} aij求偏导,令其为0:
∂ δ ∂ a i j = 1 a i j ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) + η 2 = 0 − − − − ( 10.37 ) \frac{\partial \delta}{\partial a_{ij}} = \frac{1}{a_{ij}}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} = 0----(10.37) aijδ=aij1t=1T1P(O,st=qi,st+1=qjλ(t))+η2=0(10.37)
∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) + η 2 a i j = 0 \sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} a_{ij} = 0 t=1T1P(O,st=qi,st+1=qjλ(t))+η2aij=0
因为 ∑ j = 1 N a i j = 1 \sum_{j=1}^{N}a_{ij} = 1 j=1Naij=1,对两边求和:
∑ j = 1 N [ ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) + η 2 a i j ] = 0 \sum_{j=1}^{N} \left[ \sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} a_{ij} \right ] = 0 j=1N[t=1T1P(O,st=qi,st+1=qjλ(t))+η2aij]=0
∑ j = 1 N ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) + ∑ j = 1 N η 2 a i j = 0 \sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \sum_{j=1}^{N}\eta_{2} a_{ij} = 0 j=1Nt=1T1P(O,st=qi,st+1=qjλ(t))+j=1Nη2aij=0
∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) + η 2 = 0 \sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)} )+ \eta_{2} = 0 t=1T1P(O,st=qiλ(t))+η2=0
η 2 = − ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) − − − − ( 10.37 ∗ ) \eta_{2} = -\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})----(10.37*) η2=t=1T1P(O,st=qiλ(t))(10.37)
将(10.37*)代入(10.37):
1 a i j ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) = ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) \frac{1}{a_{ij}}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) = \sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)}) aij1t=1T1P(O,st=qi,st+1=qjλ(t))=t=1T1P(O,st=qiλ(t))
a i j = ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) a_{ij} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})} aij=t=1T1P(O,st=qiλ(t))t=1T1P(O,st=qi,st+1=qjλ(t))
由于 a i j ( t + 1 ) = a r g m a x a i j Q ( λ , λ ( t ) ) a_{ij}^{(t+1)} = argmax_{a_{ij}}Q(\lambda,\lambda^{(t)}) aij(t+1)=argmaxaijQ(λ,λ(t)),所以得到更新后的 a i j ( t + 1 ) a_{ij}^{(t+1)} aij(t+1)
a i j ( t + 1 ) = ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) a_{ij}^{(t+1)} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})} aij(t+1)=t=1T1P(O,st=qiλ(t))t=1T1P(O,st=qi,st+1=qjλ(t))
最终更新得到整个状态转移矩阵 A:
A ( t + 1 ) = { a i j ( t + 1 ) } N ∗ N A^{(t+1)} = \left \{ a_{ij}^{(t+1)} \right \}_{N*N} A(t+1)={aij(t+1)}NN

(3)只有式(10.34)的第3项含有 b j ( k ) b_{j}(k) bj(k),根据第三项对参数 b j ( k ) b_{j}(k) bj(k)进行求最大化,更新 b j ( k ) b_{j}(k) bj(k)的值,具体推导如下:
b j ( k ) ( t + 1 ) = a r g m a x b j ( k ) Q ( λ , λ ( t ) ) b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}Q(\lambda,\lambda^{(t)}) bj(k)(t+1)=argmaxbj(k)Q(λ,λ(t))
= a r g m a x b j ( k ) ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T l o g b s t ( o t ) = argmax_{b_{j}(k)}\sum_{S}P(O,S|\lambda^{(t)})\sum_{t=1}^{T}logb_{s_{t}}(o_{t}) =argmaxbj(k)SP(O,Sλ(t))t=1Tlogbst(ot)
= a r g m a x b j ( k ) ∑ q 1 ∑ q 2 ⋅ ⋅ ⋅ ∑ q T P ( O , s 1 = q 1 , s 2 = q 2 , ⋅ ⋅ ⋅ , s T = q T ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g b s t ( o t ) = argmax_{b_{j}(k)}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1} = q_{1},s_{2} = q_{2},\cdot \cdot \cdot ,s_{T} = q_{T}|\lambda^{(t)}) \sum_{t=1}^{T-1}logb_{s_{t}}(o_{t}) =argmaxbj(k)q1q2qTP(O,s1=q1,s2=q2,,sT=qTλ(t))t=1T1logbst(ot)
= a r g m a x b j ( k ) ∑ q j P ( O , s t = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g b s t ( o t ) = argmax_{b_{j}(k)}\sum_{q_{j}} P(O,s_{t} = q_{j}|\lambda^{(t)}) \sum_{t=1}^{T-1}logb_{s_{t}}(o_{t}) =argmaxbj(k)qjP(O,st=qjλ(t))t=1T1logbst(ot)
= a r g m a x b j ( k ) ∑ j = 1 N ∑ t = 1 T P ( O , s t = q j ∣ λ ( t ) ) l o g b s t ( o t ) − − − − ( 10.38 ∗ ) = argmax_{b_{j}(k)}\sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{s_{t}}(o_{t})----(10.38*) =argmaxbj(k)j=1Nt=1TP(O,st=qjλ(t))logbst(ot)(10.38)
写到这你可能有疑问,就是明明在推导 b j ( k ) b_{j}(k) bj(k),但是现在就只有 b s t ( o t ) b_{s_{t}}(o_{t}) bst(ot),k在哪呢?
其实这里可以考虑一个问题,就是说因为观测序列是给定的,所以只有一个观测是正确的,也就是说 o t o_{t} ot是给定的,但是对于一个时刻下处于摸一个状态可以观测的所有观测值的概率和为 1,也就是说 ∑ k = 1 M b s t ( k ) \sum_{k=1}^{M}b_{s_{t}}(k) k=1Mbst(k) ,这里就要引入一个指示函数 I ( o t = v k ) I(o_{t} = v_{k}) I(ot=vk),这里只有在 o t = v k o_{t} = v_{k} ot=vk的时候 I ( o t = v k ) = 1 I(o_{t} = v_{k}) = 1 I(ot=vk)=1,其他情况 I ( o t = v k ) = 0 I(o_{t} = v_{k}) = 0 I(ot=vk)=0,所以这个时候可以把 b s t ( o t ) b_{s_{t}}(o_{t}) bst(ot)换成 b j ( k ) I ( o t = v k ) b_{j}(k)I(o_{t} = v_{k}) bj(k)I(ot=vk)(这里状态 s t s_{t} st换成状态 q j q_{j} qj;剩下的状态可以用指示函数替换,这里要停下来好好想一下),这样再看(10.38*),就可以改成以下形式:
b j ( k ) ( t + 1 ) = a r g m a x b j ( k ) ∑ j = 1 N ∑ t = 1 T P ( O , s t = q j ∣ λ ( t ) ) l o g b j ( k ) I ( o t = v k ) − − − − ( 10.38 ∗ ∗ ) b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}\sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{j}(k)I(o_{t} = v_{k})----(10.38**) bj(k)(t+1)=argmaxbj(k)j=1Nt=1TP(O,st=qjλ(t))logbj(k)I(ot=vk)(10.38)
(现在就可以加入这个约束 s . t . ∑ k = 1 M b j ( k ) = 1 s.t. \sum_{k=1}^{M} b_{j}(k) = 1 s.t.k=1Mbj(k)=1)
利用拉格朗日乘子法,进行后续计算,先构造 δ ( b j ( k ) , η 3 ) \delta(b_{j}(k),\eta_{3}) δ(bj(k),η3)
δ ( b j ( k ) , η ) = ∑ j = 1 N ∑ t = 1 T P ( O , s t = q j ∣ λ ( t ) ) l o g b j ( k ) I ( o t = v k ) + η 3 ( ∑ i = 1 N b j ( k ) − 1 ) \delta(b_{j}(k),\eta) = \sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{j}(k)I(o_{t} = v_{k}) + \eta_{3}(\sum_{i=1}^{N} b_{j}(k) - 1) δ(bj(k),η)=j=1Nt=1TP(O,st=qjλ(t))logbj(k)I(ot=vk)+η3(i=1Nbj(k)1)
b j ( k ) b_{j}(k) bj(k)求偏导,令其为0:
∂ δ ∂ b j ( k ) = 1 b j ( k ) ∑ j = 1 N ∑ t = 1 T P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) + η 3 = 0 − − − − ( 10.38 ) \frac{\partial \delta}{\partial b_{j}(k)} = \frac{1}{b_{j}(k)} \sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \eta_{3} = 0----(10.38) bj(k)δ=bj(k)1j=1Nt=1TP(O,st=qjλ(t))I(ot=vk)+η3=0(10.38)
∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) + η 3 b j ( k ) = 0 \sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)}) + \eta_{3} b_{j}(k) = 0 j=1NP(O,st=qjλ(t))+η3bj(k)=0
因为 ∑ j = 1 N b j ( k ) = 1 \sum_{j=1}^{N}b_{j}(k) = 1 j=1Nbj(k)=1,对两边求和:
∑ k = 1 M [ ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) + η 3 b j ( k ) ] = 0 \sum_{k=1}^{M} \left[ \sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \eta_{3} b_{j}(k) \right ] = 0 k=1M[j=1NP(O,st=qjλ(t))I(ot=vk)+η3bj(k)]=0
∑ k = 1 M ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) + ∑ j = 1 N η 3 b j ( k ) = 0 \sum_{k=1}^{M}\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \sum_{j=1}^{N}\eta_{3} b_{j}(k) = 0 k=1Mj=1NP(O,st=qjλ(t))I(ot=vk)+j=1Nη3bj(k)=0
∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) + η 3 = 0 \sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )+ \eta_{3} = 0 j=1NP(O,st=qiλ(t))+η3=0
η 3 = − ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) − − − − ( 10.38 ∗ ∗ ∗ ) \eta_{3} = -\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )----(10.38***) η3=j=1NP(O,st=qiλ(t))(10.38)
将(10.38***)代入(10.38):
1 b j ( k ) ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) = ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) \frac{1}{b_{j}(k)}\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) = \sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} ) bj(k)1j=1NP(O,st=qjλ(t))I(ot=vk)=j=1NP(O,st=qiλ(t))
b j ( k ) = ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) b_{j}(k) = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )} bj(k)=j=1NP(O,st=qiλ(t))j=1NP(O,st=qjλ(t))I(ot=vk)
由于 b j ( k ) ( t + 1 ) = a r g m a x b j ( k ) Q ( λ , λ ( t ) ) b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}Q(\lambda,\lambda^{(t)}) bj(k)(t+1)=argmaxbj(k)Q(λ,λ(t)),所以得到更新后的 b j ( k ) ( t + 1 ) b_{j}(k)^{(t+1)} bj(k)(t+1)
b j ( k ) ( t + 1 ) = ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) b_{j}(k)^{(t+1)} = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )} bj(k)(t+1)=j=1NP(O,st=qiλ(t))j=1NP(O,st=qjλ(t))I(ot=vk)
最终更新得到整个观测概率矩阵 B:
B j ( t + 1 ) ( k ) = { b j ( t + 1 ) ( k ) } N ∗ M B_{j}^{(t+1)}(k) = \left \{ b_{j}^{(t+1)}(k) \right \}_{N*M} Bj(t+1)(k)={bj(t+1)(k)}NM

10.3.3Baum-Welch模型模型参数估计公式

将式(10.36)~式(10.38)中的各概率分别用 γ t ( i ) , ξ t ( i , j ) \gamma_{t}(i),\xi_{t}(i,j) γt(i),ξt(i,j)表示,则可将相应的公式写成:
(1)对于 a i j a_{ij} aij
a i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) − − − − − ( 10.39 ) a_{ij} = \frac{\sum_{t=1}^{T-1}\xi_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)}-----(10.39) aij=t=1T1γt(i)t=1T1ξt(i,j)(10.39)
a i j = ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) − − − − − ( 10.39 ∗ ) a_{ij} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})}-----(10.39*) aij=t=1T1P(O,st=qiλ(t))t=1T1P(O,st=qi,st+1=qjλ(t))(10.39)
(10.39*)是给(10.39)用作对比参考

(2)对于 b j ( k ) b_{j}(k) bj(k)
b j ( k ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) − − − − ( 10.40 ) b_{j}(k) = \frac{\sum_{t=1,o_{t}=v_{k}}^{T}\gamma_{t}(j)}{\sum_{t=1}^{T}\gamma_{t}(j)}----(10.40) bj(k)=t=1Tγt(j)t=1,ot=vkTγt(j)(10.40)
b j ( k ) ( t + 1 ) = ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) − − − − ( 10.40 ∗ ) b_{j}(k)^{(t+1)} = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )}----(10.40*) bj(k)(t+1)=j=1NP(O,st=qiλ(t))j=1NP(O,st=qjλ(t))I(ot=vk)(10.40)
(10.40*)是给(10.40)用作对比参考

(3)对于 π i \pi_{i} πi
π i = γ 1 ( i ) − − − − ( 10.41 ) \pi_{i} = \gamma_{1}(i)----(10.41) πi=γ1(i)(10.41)
π i = P ( O , s 1 = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) − − − − ( 10.41 ∗ ) \pi_{i} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})}----(10.41*) πi=P(Oλ(t))P(O,s1=qiλ(t))(10.41)
(10.41*)是给(10.41)用作对比参考

(4)对 γ t ( i ) \gamma_{t}(i) γt(i) ξ t ( i , j ) \xi_{t}(i,j) ξt(i,j)做一个总结:
γ t ( i ) = P ( O , s t = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) \gamma_{t}(i) = \frac{P(O,s_{t} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})} γt(i)=P(Oλ(t))P(O,st=qiλ(t))
ξ t ( i , j ) = P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) \xi_{t}(i,j) = \frac{P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{P(O|\lambda^{(t)})} ξt(i,j)=P(Oλ(t))P(O,st=qi,st+1=qjλ(t))
这才是 γ t ( i ) \gamma_{t}(i) γt(i) ξ t ( i , j ) \xi_{t}(i,j) ξt(i,j)真正的样子

算法10.4(Baum-Welch算法)
输入:观测数据 O = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T ) O = (o_{1},o_{2},\cdot \cdot \cdot,o_{T}) O=(o1,o2,,oT)
输出:HMM的模型参数 λ \lambda λ
(1)初始化。对n = 0,选取 a i j ( 0 ) , b j ( k ) ( 0 ) , π i ( 0 ) a_{ij}^{(0)},b_{j}(k)^{(0)},\pi_{i}^{(0)} aij(0),bj(k)(0),πi(0),得到模型 λ ( 0 ) = ( A ( 0 ) , B ( 0 ) , π ( 0 ) ) \lambda^{(0)} = (A^{(0)},B^{(0)},\pi^{(0)}) λ(0)=(A(0),B(0),π(0)).
(2)递推。对 n = 1 , 2 , ⋅ ⋅ ⋅ , n = 1,2, \cdot \cdot \cdot , n=1,2,,
a i j ( n + 1 ) = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) a_{ij}^{(n+1)} = \frac{\sum_{t=1}^{T-1}\xi_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)} aij(n+1)=t=1T1γt(i)t=1T1ξt(i,j)
b j ( k ) ( n + 1 ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t + 1 T γ t ( j ) b_{j}(k)^{(n+1)} = \frac{\sum_{t=1,o_{t}=v_{k}}^{T}\gamma_{t}(j)}{\sum_{t+1}^{T}\gamma_{t}(j)} bj(k)(n+1)=t+1Tγt(j)t=1,ot=vkTγt(j)
π i ( t + 1 ) = γ 1 ( i ) \pi_{i}^{(t+1)} = \gamma_{1}(i) πi(t+1)=γ1(i)
(3)终止。得到模型参数 λ ( n + 1 ) = ( a i j ( n + 1 ) , b j ( k ) ( n + 1 ) , π i ( t + 1 ) ) \lambda^{(n+1)} = (a_{ij}^{(n+1)},b_{j}(k)^{(n+1)},\pi_{i}^{(t+1)}) λ(n+1)=(aij(n+1),bj(k)(n+1),πi(t+1)).

参考文献

以下是HMM系列文章的参考文献:

  1. 李航——《统计学习方法》
  2. YouTube——shuhuai008的视频课程HMM
  3. YouTube——徐亦达机器学习HMM、EM
  4. *[https://www.huaxiaozhuan.com/%E7%BB%9F%E8%AE%A1%E5%AD%A6%E4%B9%A0/chapters/15_HMM.html]:隐马尔可夫模型
  5. [https://sm1les.com/2019/04/10/hidden-markov-model/]:隐马尔可夫模型(HMM)及其三个基本问题
  6. 例子可以看这个[https://www.cnblogs.com/skyme/p/4651331.html]:一文搞懂HMM(隐马尔可夫模型)
  7. [https://www.zhihu.com/question/55974064]:南屏晚钟的解答

感谢以上作者对本文的贡献,如有侵权联系后删除相应内容。

你可能感兴趣的:(算法,学习,学习方法)