假设已给出训练数据包含 S 个长度相同的观测序列和对应的状态序列 ( O 1 , I 1 ) , ( O 2 , I 2 ) , ⋅ ⋅ ⋅ , ( O T , I T ) {(O_{1},I_{1}),(O_{2},I_{2}), \cdot \cdot \cdot ,(O_{T},I_{T})} (O1,I1),(O2,I2),⋅⋅⋅,(OT,IT),那么可以利用极大似然估计法来估计隐马尔可夫模型的参数,具体方法如下:
显然此训练数据中的状态序列数据通常是需要人工标注出来的,因此代价较高,所以非监督学习的方法更为实用,例如Baum-Welch算法。
如果只有观测序列数据 O = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T ) O = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T}) O=(o1,o2,⋅⋅⋅,oT),而没有状态序列数据 S = ( s 1 , s 2 , ⋅ ⋅ ⋅ , s T ) S = (s_{1},s_{2}, \cdot \cdot \cdot,s_{T}) S=(s1,s2,⋅⋅⋅,sT),那么隐马尔可夫模型就是一个含有隐变量的概率模型:
P ( O ∣ λ ) = ∑ S P ( O ∣ I , λ ) P ( I ∣ λ ) − − − − ( 10.32 ) P(O|λ) = \sum_{S}P(O|I,\lambda)P(I|\lambda)----(10.32) P(O∣λ)=S∑P(O∣I,λ)P(I∣λ)−−−−(10.32)
如果要对它进行参数估计,则可以采用EM算法来实现,具体步骤如下:
确定完全数据的对数似然函数
此时观测数据为 O = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T ) O = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T}) O=(o1,o2,⋅⋅⋅,oT),未观测数据为 S = ( s 1 , s 2 , ⋅ ⋅ ⋅ , s T ) S = (s_{1},s_{2}, \cdot \cdot \cdot,s_{T}) S=(s1,s2,⋅⋅⋅,sT),则完全数据为 ( O , I ) = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T , s 1 , s 2 , ⋅ ⋅ ⋅ , s T ) (O,I) = (o_{1},o_{2}, \cdot \cdot \cdot,o_{T},s_{1},s_{2}, \cdot \cdot \cdot,s_{T}) (O,I)=(o1,o2,⋅⋅⋅,oT,s1,s2,⋅⋅⋅,sT),完全数据的对数似然函数为:
l o g P ( O , I ∣ λ ) logP(O,I|\lambda) logP(O,I∣λ)
其中, P ( O , I ∣ λ ) = π s 1 b s 1 ( o 1 ) a s 1 s 2 b s 2 ( o 2 ) ⋅ ⋅ ⋅ a s T − 1 s T b s T ( o T ) P(O,I|\lambda) = \pi_{s_{1}}b_{s_{1}}(o_{1})a_{s_{1}s_{2}}b_{s_{2}}(o_{2}) \cdot \cdot \cdot a_{s_{T-1}s_{T}}b_{s_{T}}(o_{T}) P(O,I∣λ)=πs1bs1(o1)as1s2bs2(o2)⋅⋅⋅asT−1sTbsT(oT),所以可以进一步推得:
l o g P ( O , I ∣ λ ) logP(O,I|\lambda) logP(O,I∣λ)
= l o g ( π s 1 b s 1 ( o 1 ) a s 1 s 2 b s 2 ( o 2 ) ⋅ ⋅ ⋅ a s T − 1 s T b s T ( o T ) ) = log(\pi_{s_{1}}b_{s_{1}}(o_{1})a_{s_{1}s_{2}}b_{s_{2}}(o_{2}) \cdot \cdot \cdot a_{s_{T-1}s_{T}}b_{s_{T}}(o_{T})) =log(πs1bs1(o1)as1s2bs2(o2)⋅⋅⋅asT−1sTbsT(oT))
= l o g π s 1 + ∑ t = 1 T − 1 l n a s t s t + 1 + ∑ t = 1 T l n b s t ( o t ) − − − − ( 10.33 ∗ ) = log\pi_{s_{1}} + \sum_{t =1}^{T-1}lna_{s_{t}s_{t+1}} + \sum_{t=1}^{T}lnb_{s_{t}}(o_{t})----(10.33*) =logπs1+∑t=1T−1lnastst+1+∑t=1Tlnbst(ot)−−−−(10.33∗)
EM算法E步:求Q函数 Q ( λ , λ ( t ) ) Q(\lambda,\lambda^{(t)}) Q(λ,λ(t))
Q ( λ , λ ( t ) ) = ∑ S P ( O , S ∣ λ ( t ) ) l o g P ( O , S ∣ λ ) − − − − ( 10.33 ) Q(\lambda,\lambda^{(t)}) = \sum_{S}P(O,S|\lambda^{(t)})logP(O,S|\lambda)----(10.33) Q(λ,λ(t))=S∑P(O,S∣λ(t))logP(O,S∣λ)−−−−(10.33)
其中, λ ( t ) \lambda^{(t)} λ(t)是隐马尔可夫模型参数的当前估计值,λ 是要极大化的隐马尔可夫模型参数。为了便于后续计算,Q 函数还可以作如下恒等变形,将(10.33*)代入:
Q ( λ , λ ( t ) ) = ∑ S P ( O , S ∣ λ ( t ) ) l o g π s 1 + ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 + ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T l o g b s t ( o t ) − − − − ( 10.34 ) Q(\lambda,\lambda^{(t)})= \sum_{S}P(O,S|\lambda^{(t)})log\pi_{s_{1}} + \sum_{S}P(O,S|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} + \sum_{S}P(O,S|\lambda^{(t)})\sum_{t=1}^{T}logb_{s_{t}}(o_{t})----(10.34) Q(λ,λ(t))=S∑P(O,S∣λ(t))logπs1+S∑P(O,S∣λ(t))t=1∑T−1logastst+1+S∑P(O,S∣λ(t))t=1∑Tlogbst(ot)−−−−(10.34)
EM算法的M步:极大化Q函数 Q ( λ , λ ( t ) ) Q(\lambda,\lambda^{(t)}) Q(λ,λ(t))求模型参数 A , B , π A,B,\pi A,B,π
(1)只有式(10.34)的第1项含有 π s i \pi_{s_{i}} πsi,根据第一项对参数 π \pi π进行求最大化,更新 π \pi π的值,具体推导如下:
π ( t + 1 ) = a r g m a x π Q ( λ , λ ( t ) ) \pi^{(t+1)} = argmax_{\pi}Q(\lambda,\lambda^{(t)}) π(t+1)=argmaxπQ(λ,λ(t))
= a r g m a x π ∑ S P ( O , S ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{S}P(O,S|\lambda^{(t)})log\pi_{s_{1}} =argmaxπ∑SP(O,S∣λ(t))logπs1
= a r g m a x π ∑ q 1 ∑ q 2 ⋅ ⋅ ⋅ ∑ q T P ( O , s 1 = q 1 , s 2 = q 2 , ⋅ ⋅ ⋅ , s T = q T , ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1} = q_{1},s_{2} = q_{2},\cdot \cdot \cdot ,s_{T} = q_{T},|\lambda^{(t)})log\pi_{s_{1}} =argmaxπ∑q1∑q2⋅⋅⋅∑qTP(O,s1=q1,s2=q2,⋅⋅⋅,sT=qT,∣λ(t))logπs1
= a r g m a x π ∑ q 1 P ( O , s 1 = q 1 ∣ λ ( t ) ) l o g π s 1 = argmax_{\pi}\sum_{q_{1}}P(O,s_{1} = q_{1}|\lambda^{(t)})log\pi_{s_{1}} =argmaxπ∑q1P(O,s1=q1∣λ(t))logπs1
= a r g m a x π ∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) l o g π i = argmax_{\pi}\sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)})log\pi_{i} =argmaxπ∑i=1NP(O,s1=qi∣λ(t))logπi
(这里隐含了一个约束 s . t . ∑ i = 1 N π i = 1 s.t. \sum_{i=1}^{N}\pi_{i} = 1 s.t.∑i=1Nπi=1)
利用拉格朗日乘子法,进行后续计算,先构造 δ ( π , η 1 ) \delta(\pi,\eta_{1}) δ(π,η1):
δ ( π , η ) = ∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) l o g π i + η 1 ( ∑ i = 1 N π − 1 ) \delta(\pi,\eta) = \sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)})log\pi_{i} + \eta_{1}(\sum_{i=1}^{N}\pi - 1) δ(π,η)=i=1∑NP(O,s1=qi∣λ(t))logπi+η1(i=1∑Nπ−1)
对 π i \pi_{i} πi求偏导,令其为0:
∂ δ ∂ π i = 1 π i P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 = 0 − − − − ( 10.35 ) \frac{\partial \delta}{\partial \pi_{i}} = \frac{1}{\pi_{i}}P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} = 0----(10.35) ∂πi∂δ=πi1P(O,s1=qi∣λ(t))+η1=0−−−−(10.35)
P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 π i = 0 P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} \pi_{i} = 0 P(O,s1=qi∣λ(t))+η1πi=0
因为 ∑ i = 1 N π i = 1 \sum_{i=1}^{N}\pi_{i} = 1 ∑i=1Nπi=1,对两边求和:
∑ i = 1 N [ P ( O , s 1 = q i ∣ λ ( t ) ) + η 1 π i ] = 0 \sum_{i=1}^{N} \left[ P(O,s_{1} = q_{i}|\lambda^{(t)}) + \eta_{1} \pi_{i} \right ] = 0 i=1∑N[P(O,s1=qi∣λ(t))+η1πi]=0
∑ i = 1 N P ( O , s 1 = q i ∣ λ ( t ) ) + ∑ i = 1 N η 1 π i = 0 \sum_{i=1}^{N}P(O,s_{1} = q_{i}|\lambda^{(t)}) + \sum_{i=1}^{N}\eta_{1} \pi_{i} = 0 i=1∑NP(O,s1=qi∣λ(t))+i=1∑Nη1πi=0
P ( O ∣ λ ( t ) ) + η 1 = 0 P(O|\lambda^{(t)}) + \eta_{1} = 0 P(O∣λ(t))+η1=0
η 1 = − P ( O ∣ λ ( t ) ) − − − − ( 10.35 ∗ ) \eta_{1} = -P(O|\lambda^{(t)})----(10.35*) η1=−P(O∣λ(t))−−−−(10.35∗)
将(10.35*)代入(10.35):
1 π i P ( O , s 1 = q i ∣ λ ( t ) ) − P ( O ∣ λ ( t ) ) = 0 \frac{1}{\pi_{i}}P(O,s_{1} = q_{i}|\lambda^{(t)}) - P(O|\lambda^{(t)}) = 0 πi1P(O,s1=qi∣λ(t))−P(O∣λ(t))=0
π i = P ( O , s 1 = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) − − − − ( 10.36 ) \pi_{i} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})}----(10.36) πi=P(O∣λ(t))P(O,s1=qi∣λ(t))−−−−(10.36)
由于 π ( t + 1 ) = a r g m a x π Q ( λ , λ ( t ) ) \pi^{(t+1)} = argmax_{\pi}Q(\lambda,\lambda^{(t)}) π(t+1)=argmaxπQ(λ,λ(t)),所以得到更新后的 π ( t + 1 ) \pi^{(t+1)} π(t+1):
π ( t + 1 ) = P ( O , s 1 = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) \pi^{(t+1)} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})} π(t+1)=P(O∣λ(t))P(O,s1=qi∣λ(t))
最终更新得到整个初始概率向量 π \pi π:
π ( t + 1 ) = ( π 1 ( t + 1 ) , π 2 ( t + 1 ) , ⋅ ⋅ ⋅ , π N ( t + 1 ) ) \pi^{(t+1)} = (\pi_{1}^{(t+1)},\pi_{2}^{(t+1)}, \cdot \cdot \cdot ,\pi_{N}^{(t+1)}) π(t+1)=(π1(t+1),π2(t+1),⋅⋅⋅,πN(t+1))
(2)只有式(10.34)的第2项含有 a i j a_{ij} aij,根据第二项对参数 a i j a_{ij} aij进行求最大化,更新 a i j a_{ij} aij的值,具体推导如下:
a i j t + 1 = a r g m a x a i j Q ( λ , λ ( t ) ) a_{ij}^{t+1} = argmax_{a_{ij}}Q(\lambda,\lambda^{(t)}) aijt+1=argmaxaijQ(λ,λ(t))
= a r g m a x a i j ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 = argmax_{a_{ij}}\sum_{S}P(O,S|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} =argmaxaij∑SP(O,S∣λ(t))∑t=1T−1logastst+1
= a r g m a x a i j ∑ q 1 ∑ q 2 ⋅ ⋅ ⋅ ∑ q T P ( O , s 1 = q 1 , s 2 = q 2 , ⋅ ⋅ ⋅ , s T = q T ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 = argmax_{a_{ij}}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1} = q_{1},s_{2} = q_{2},\cdot \cdot \cdot ,s_{T} = q_{T}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} =argmaxaij∑q1∑q2⋅⋅⋅∑qTP(O,s1=q1,s2=q2,⋅⋅⋅,sT=qT∣λ(t))∑t=1T−1logastst+1
= a r g m a x a i j ∑ q t ∑ q t + 1 P ( O , s t = q t , s t + 1 = q t + 1 ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a s t s t + 1 = argmax_{a_{ij}}\sum_{q_{t}}\sum_{q_{t+1}} P(O,s_{t} = q_{t},s_{t+1} = q_{t+1}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{s_{t}s_{t+1}} =argmaxaij∑qt∑qt+1P(O,st=qt,st+1=qt+1∣λ(t))∑t=1T−1logastst+1
= a r g m a x a i j ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a i j = argmax_{a_{ij}}\sum_{i=1}^{N}\sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{ij} =argmaxaij∑i=1N∑j=1N∑t=1T−1P(O,st=qi,st+1=qj∣λ(t))∑t=1T−1logaij
(这里隐含了一个约束 s . t . ∑ i = 1 N a i j = 1 s.t. \sum_{i=1}^{N} a_{ij} = 1 s.t.∑i=1Naij=1)
利用拉格朗日乘子法,进行后续计算,先构造 δ ( a i j , η 2 ) \delta(a_{ij},\eta_{2}) δ(aij,η2):
δ ( a i j , η ) = ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g a i j + η 2 ( ∑ i = 1 N a i j − 1 ) \delta(a_{ij},\eta) = \sum_{i=1}^{N}\sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) \sum_{t=1}^{T-1}loga_{ij} + \eta_{2}(\sum_{i=1}^{N} a_{ij} - 1) δ(aij,η)=i=1∑Nj=1∑Nt=1∑T−1P(O,st=qi,st+1=qj∣λ(t))t=1∑T−1logaij+η2(i=1∑Naij−1)
对 a i j a_{ij} aij求偏导,令其为0:
∂ δ ∂ a i j = 1 a i j ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) + η 2 = 0 − − − − ( 10.37 ) \frac{\partial \delta}{\partial a_{ij}} = \frac{1}{a_{ij}}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} = 0----(10.37) ∂aij∂δ=aij1t=1∑T−1P(O,st=qi,st+1=qj∣λ(t))+η2=0−−−−(10.37)
∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) + η 2 a i j = 0 \sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} a_{ij} = 0 t=1∑T−1P(O,st=qi,st+1=qj∣λ(t))+η2aij=0
因为 ∑ j = 1 N a i j = 1 \sum_{j=1}^{N}a_{ij} = 1 ∑j=1Naij=1,对两边求和:
∑ j = 1 N [ ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) + η 2 a i j ] = 0 \sum_{j=1}^{N} \left[ \sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \eta_{2} a_{ij} \right ] = 0 j=1∑N[t=1∑T−1P(O,st=qi,st+1=qj∣λ(t))+η2aij]=0
∑ j = 1 N ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) + ∑ j = 1 N η 2 a i j = 0 \sum_{j=1}^{N}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) + \sum_{j=1}^{N}\eta_{2} a_{ij} = 0 j=1∑Nt=1∑T−1P(O,st=qi,st+1=qj∣λ(t))+j=1∑Nη2aij=0
∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) + η 2 = 0 \sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)} )+ \eta_{2} = 0 t=1∑T−1P(O,st=qi∣λ(t))+η2=0
η 2 = − ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) − − − − ( 10.37 ∗ ) \eta_{2} = -\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})----(10.37*) η2=−t=1∑T−1P(O,st=qi∣λ(t))−−−−(10.37∗)
将(10.37*)代入(10.37):
1 a i j ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) = ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) \frac{1}{a_{ij}}\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)}) = \sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)}) aij1t=1∑T−1P(O,st=qi,st+1=qj∣λ(t))=t=1∑T−1P(O,st=qi∣λ(t))
a i j = ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) a_{ij} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})} aij=∑t=1T−1P(O,st=qi∣λ(t))∑t=1T−1P(O,st=qi,st+1=qj∣λ(t))
由于 a i j ( t + 1 ) = a r g m a x a i j Q ( λ , λ ( t ) ) a_{ij}^{(t+1)} = argmax_{a_{ij}}Q(\lambda,\lambda^{(t)}) aij(t+1)=argmaxaijQ(λ,λ(t)),所以得到更新后的 a i j ( t + 1 ) a_{ij}^{(t+1)} aij(t+1):
a i j ( t + 1 ) = ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) a_{ij}^{(t+1)} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})} aij(t+1)=∑t=1T−1P(O,st=qi∣λ(t))∑t=1T−1P(O,st=qi,st+1=qj∣λ(t))
最终更新得到整个状态转移矩阵 A:
A ( t + 1 ) = { a i j ( t + 1 ) } N ∗ N A^{(t+1)} = \left \{ a_{ij}^{(t+1)} \right \}_{N*N} A(t+1)={aij(t+1)}N∗N
(3)只有式(10.34)的第3项含有 b j ( k ) b_{j}(k) bj(k),根据第三项对参数 b j ( k ) b_{j}(k) bj(k)进行求最大化,更新 b j ( k ) b_{j}(k) bj(k)的值,具体推导如下:
b j ( k ) ( t + 1 ) = a r g m a x b j ( k ) Q ( λ , λ ( t ) ) b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}Q(\lambda,\lambda^{(t)}) bj(k)(t+1)=argmaxbj(k)Q(λ,λ(t))
= a r g m a x b j ( k ) ∑ S P ( O , S ∣ λ ( t ) ) ∑ t = 1 T l o g b s t ( o t ) = argmax_{b_{j}(k)}\sum_{S}P(O,S|\lambda^{(t)})\sum_{t=1}^{T}logb_{s_{t}}(o_{t}) =argmaxbj(k)∑SP(O,S∣λ(t))∑t=1Tlogbst(ot)
= a r g m a x b j ( k ) ∑ q 1 ∑ q 2 ⋅ ⋅ ⋅ ∑ q T P ( O , s 1 = q 1 , s 2 = q 2 , ⋅ ⋅ ⋅ , s T = q T ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g b s t ( o t ) = argmax_{b_{j}(k)}\sum_{q_{1}}\sum_{q_{2}} \cdot \cdot \cdot \sum_{q_{T}}P(O,s_{1} = q_{1},s_{2} = q_{2},\cdot \cdot \cdot ,s_{T} = q_{T}|\lambda^{(t)}) \sum_{t=1}^{T-1}logb_{s_{t}}(o_{t}) =argmaxbj(k)∑q1∑q2⋅⋅⋅∑qTP(O,s1=q1,s2=q2,⋅⋅⋅,sT=qT∣λ(t))∑t=1T−1logbst(ot)
= a r g m a x b j ( k ) ∑ q j P ( O , s t = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 l o g b s t ( o t ) = argmax_{b_{j}(k)}\sum_{q_{j}} P(O,s_{t} = q_{j}|\lambda^{(t)}) \sum_{t=1}^{T-1}logb_{s_{t}}(o_{t}) =argmaxbj(k)∑qjP(O,st=qj∣λ(t))∑t=1T−1logbst(ot)
= a r g m a x b j ( k ) ∑ j = 1 N ∑ t = 1 T P ( O , s t = q j ∣ λ ( t ) ) l o g b s t ( o t ) − − − − ( 10.38 ∗ ) = argmax_{b_{j}(k)}\sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{s_{t}}(o_{t})----(10.38*) =argmaxbj(k)∑j=1N∑t=1TP(O,st=qj∣λ(t))logbst(ot)−−−−(10.38∗)
写到这你可能有疑问,就是明明在推导 b j ( k ) b_{j}(k) bj(k),但是现在就只有 b s t ( o t ) b_{s_{t}}(o_{t}) bst(ot),k在哪呢?
其实这里可以考虑一个问题,就是说因为观测序列是给定的,所以只有一个观测是正确的,也就是说 o t o_{t} ot是给定的,但是对于一个时刻下处于摸一个状态可以观测的所有观测值的概率和为 1,也就是说 ∑ k = 1 M b s t ( k ) \sum_{k=1}^{M}b_{s_{t}}(k) ∑k=1Mbst(k) ,这里就要引入一个指示函数 I ( o t = v k ) I(o_{t} = v_{k}) I(ot=vk),这里只有在 o t = v k o_{t} = v_{k} ot=vk的时候 I ( o t = v k ) = 1 I(o_{t} = v_{k}) = 1 I(ot=vk)=1,其他情况 I ( o t = v k ) = 0 I(o_{t} = v_{k}) = 0 I(ot=vk)=0,所以这个时候可以把 b s t ( o t ) b_{s_{t}}(o_{t}) bst(ot)换成 b j ( k ) I ( o t = v k ) b_{j}(k)I(o_{t} = v_{k}) bj(k)I(ot=vk)(这里状态 s t s_{t} st换成状态 q j q_{j} qj;剩下的状态可以用指示函数替换,这里要停下来好好想一下),这样再看(10.38*),就可以改成以下形式:
b j ( k ) ( t + 1 ) = a r g m a x b j ( k ) ∑ j = 1 N ∑ t = 1 T P ( O , s t = q j ∣ λ ( t ) ) l o g b j ( k ) I ( o t = v k ) − − − − ( 10.38 ∗ ∗ ) b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}\sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{j}(k)I(o_{t} = v_{k})----(10.38**) bj(k)(t+1)=argmaxbj(k)∑j=1N∑t=1TP(O,st=qj∣λ(t))logbj(k)I(ot=vk)−−−−(10.38∗∗)
(现在就可以加入这个约束 s . t . ∑ k = 1 M b j ( k ) = 1 s.t. \sum_{k=1}^{M} b_{j}(k) = 1 s.t.∑k=1Mbj(k)=1)
利用拉格朗日乘子法,进行后续计算,先构造 δ ( b j ( k ) , η 3 ) \delta(b_{j}(k),\eta_{3}) δ(bj(k),η3):
δ ( b j ( k ) , η ) = ∑ j = 1 N ∑ t = 1 T P ( O , s t = q j ∣ λ ( t ) ) l o g b j ( k ) I ( o t = v k ) + η 3 ( ∑ i = 1 N b j ( k ) − 1 ) \delta(b_{j}(k),\eta) = \sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})logb_{j}(k)I(o_{t} = v_{k}) + \eta_{3}(\sum_{i=1}^{N} b_{j}(k) - 1) δ(bj(k),η)=j=1∑Nt=1∑TP(O,st=qj∣λ(t))logbj(k)I(ot=vk)+η3(i=1∑Nbj(k)−1)
对 b j ( k ) b_{j}(k) bj(k)求偏导,令其为0:
∂ δ ∂ b j ( k ) = 1 b j ( k ) ∑ j = 1 N ∑ t = 1 T P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) + η 3 = 0 − − − − ( 10.38 ) \frac{\partial \delta}{\partial b_{j}(k)} = \frac{1}{b_{j}(k)} \sum_{j=1}^{N}\sum_{t=1}^{T} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \eta_{3} = 0----(10.38) ∂bj(k)∂δ=bj(k)1j=1∑Nt=1∑TP(O,st=qj∣λ(t))I(ot=vk)+η3=0−−−−(10.38)
∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) + η 3 b j ( k ) = 0 \sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)}) + \eta_{3} b_{j}(k) = 0 j=1∑NP(O,st=qj∣λ(t))+η3bj(k)=0
因为 ∑ j = 1 N b j ( k ) = 1 \sum_{j=1}^{N}b_{j}(k) = 1 ∑j=1Nbj(k)=1,对两边求和:
∑ k = 1 M [ ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) + η 3 b j ( k ) ] = 0 \sum_{k=1}^{M} \left[ \sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \eta_{3} b_{j}(k) \right ] = 0 k=1∑M[j=1∑NP(O,st=qj∣λ(t))I(ot=vk)+η3bj(k)]=0
∑ k = 1 M ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) + ∑ j = 1 N η 3 b j ( k ) = 0 \sum_{k=1}^{M}\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) + \sum_{j=1}^{N}\eta_{3} b_{j}(k) = 0 k=1∑Mj=1∑NP(O,st=qj∣λ(t))I(ot=vk)+j=1∑Nη3bj(k)=0
∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) + η 3 = 0 \sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )+ \eta_{3} = 0 j=1∑NP(O,st=qi∣λ(t))+η3=0
η 3 = − ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) − − − − ( 10.38 ∗ ∗ ∗ ) \eta_{3} = -\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )----(10.38***) η3=−j=1∑NP(O,st=qi∣λ(t))−−−−(10.38∗∗∗)
将(10.38***)代入(10.38):
1 b j ( k ) ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) = ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) \frac{1}{b_{j}(k)}\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k}) = \sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} ) bj(k)1j=1∑NP(O,st=qj∣λ(t))I(ot=vk)=j=1∑NP(O,st=qi∣λ(t))
b j ( k ) = ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) b_{j}(k) = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )} bj(k)=∑j=1NP(O,st=qi∣λ(t))∑j=1NP(O,st=qj∣λ(t))I(ot=vk)
由于 b j ( k ) ( t + 1 ) = a r g m a x b j ( k ) Q ( λ , λ ( t ) ) b_{j}(k)^{(t+1)} = argmax_{b_{j}(k)}Q(\lambda,\lambda^{(t)}) bj(k)(t+1)=argmaxbj(k)Q(λ,λ(t)),所以得到更新后的 b j ( k ) ( t + 1 ) b_{j}(k)^{(t+1)} bj(k)(t+1):
b j ( k ) ( t + 1 ) = ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) b_{j}(k)^{(t+1)} = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )} bj(k)(t+1)=∑j=1NP(O,st=qi∣λ(t))∑j=1NP(O,st=qj∣λ(t))I(ot=vk)
最终更新得到整个观测概率矩阵 B:
B j ( t + 1 ) ( k ) = { b j ( t + 1 ) ( k ) } N ∗ M B_{j}^{(t+1)}(k) = \left \{ b_{j}^{(t+1)}(k) \right \}_{N*M} Bj(t+1)(k)={bj(t+1)(k)}N∗M
将式(10.36)~式(10.38)中的各概率分别用 γ t ( i ) , ξ t ( i , j ) \gamma_{t}(i),\xi_{t}(i,j) γt(i),ξt(i,j)表示,则可将相应的公式写成:
(1)对于 a i j a_{ij} aij:
a i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) − − − − − ( 10.39 ) a_{ij} = \frac{\sum_{t=1}^{T-1}\xi_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)}-----(10.39) aij=∑t=1T−1γt(i)∑t=1T−1ξt(i,j)−−−−−(10.39)
a i j = ∑ t = 1 T − 1 P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) ∑ t = 1 T − 1 P ( O , s t = q i ∣ λ ( t ) ) − − − − − ( 10.39 ∗ ) a_{ij} = \frac{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{\sum_{t=1}^{T-1} P(O,s_{t} = q_{i}|\lambda^{(t)})}-----(10.39*) aij=∑t=1T−1P(O,st=qi∣λ(t))∑t=1T−1P(O,st=qi,st+1=qj∣λ(t))−−−−−(10.39∗)
(10.39*)是给(10.39)用作对比参考
(2)对于 b j ( k ) b_{j}(k) bj(k):
b j ( k ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) − − − − ( 10.40 ) b_{j}(k) = \frac{\sum_{t=1,o_{t}=v_{k}}^{T}\gamma_{t}(j)}{\sum_{t=1}^{T}\gamma_{t}(j)}----(10.40) bj(k)=∑t=1Tγt(j)∑t=1,ot=vkTγt(j)−−−−(10.40)
b j ( k ) ( t + 1 ) = ∑ j = 1 N P ( O , s t = q j ∣ λ ( t ) ) I ( o t = v k ) ∑ j = 1 N P ( O , s t = q i ∣ λ ( t ) ) − − − − ( 10.40 ∗ ) b_{j}(k)^{(t+1)} = \frac{\sum_{j=1}^{N} P(O,s_{t} = q_{j}|\lambda^{(t)})I(o_{t} = v_{k})}{\sum_{j=1}^{N} P(O,s_{t} = q_{i}|\lambda^{(t)} )}----(10.40*) bj(k)(t+1)=∑j=1NP(O,st=qi∣λ(t))∑j=1NP(O,st=qj∣λ(t))I(ot=vk)−−−−(10.40∗)
(10.40*)是给(10.40)用作对比参考
(3)对于 π i \pi_{i} πi:
π i = γ 1 ( i ) − − − − ( 10.41 ) \pi_{i} = \gamma_{1}(i)----(10.41) πi=γ1(i)−−−−(10.41)
π i = P ( O , s 1 = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) − − − − ( 10.41 ∗ ) \pi_{i} = \frac{P(O,s_{1} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})}----(10.41*) πi=P(O∣λ(t))P(O,s1=qi∣λ(t))−−−−(10.41∗)
(10.41*)是给(10.41)用作对比参考
(4)对 γ t ( i ) \gamma_{t}(i) γt(i)和 ξ t ( i , j ) \xi_{t}(i,j) ξt(i,j)做一个总结:
γ t ( i ) = P ( O , s t = q i ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) \gamma_{t}(i) = \frac{P(O,s_{t} = q_{i}|\lambda^{(t)})}{P(O|\lambda^{(t)})} γt(i)=P(O∣λ(t))P(O,st=qi∣λ(t))
ξ t ( i , j ) = P ( O , s t = q i , s t + 1 = q j ∣ λ ( t ) ) P ( O ∣ λ ( t ) ) \xi_{t}(i,j) = \frac{P(O,s_{t} = q_{i},s_{t+1} = q_{j}|\lambda^{(t)})}{P(O|\lambda^{(t)})} ξt(i,j)=P(O∣λ(t))P(O,st=qi,st+1=qj∣λ(t))
这才是 γ t ( i ) \gamma_{t}(i) γt(i)和 ξ t ( i , j ) \xi_{t}(i,j) ξt(i,j)真正的样子
算法10.4(Baum-Welch算法)
输入:观测数据 O = ( o 1 , o 2 , ⋅ ⋅ ⋅ , o T ) O = (o_{1},o_{2},\cdot \cdot \cdot,o_{T}) O=(o1,o2,⋅⋅⋅,oT)
输出:HMM的模型参数 λ \lambda λ
(1)初始化。对n = 0,选取 a i j ( 0 ) , b j ( k ) ( 0 ) , π i ( 0 ) a_{ij}^{(0)},b_{j}(k)^{(0)},\pi_{i}^{(0)} aij(0),bj(k)(0),πi(0),得到模型 λ ( 0 ) = ( A ( 0 ) , B ( 0 ) , π ( 0 ) ) \lambda^{(0)} = (A^{(0)},B^{(0)},\pi^{(0)}) λ(0)=(A(0),B(0),π(0)).
(2)递推。对 n = 1 , 2 , ⋅ ⋅ ⋅ , n = 1,2, \cdot \cdot \cdot , n=1,2,⋅⋅⋅,
a i j ( n + 1 ) = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) a_{ij}^{(n+1)} = \frac{\sum_{t=1}^{T-1}\xi_{t}(i,j)}{\sum_{t=1}^{T-1}\gamma_{t}(i)} aij(n+1)=∑t=1T−1γt(i)∑t=1T−1ξt(i,j)
b j ( k ) ( n + 1 ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t + 1 T γ t ( j ) b_{j}(k)^{(n+1)} = \frac{\sum_{t=1,o_{t}=v_{k}}^{T}\gamma_{t}(j)}{\sum_{t+1}^{T}\gamma_{t}(j)} bj(k)(n+1)=∑t+1Tγt(j)∑t=1,ot=vkTγt(j)
π i ( t + 1 ) = γ 1 ( i ) \pi_{i}^{(t+1)} = \gamma_{1}(i) πi(t+1)=γ1(i)
(3)终止。得到模型参数 λ ( n + 1 ) = ( a i j ( n + 1 ) , b j ( k ) ( n + 1 ) , π i ( t + 1 ) ) \lambda^{(n+1)} = (a_{ij}^{(n+1)},b_{j}(k)^{(n+1)},\pi_{i}^{(t+1)}) λ(n+1)=(aij(n+1),bj(k)(n+1),πi(t+1)).
以下是HMM系列文章的参考文献:
感谢以上作者对本文的贡献,如有侵权联系后删除相应内容。