有任何的书写错误、排版错误、概念错误等,希望大家包含指正。
由于字数限制,分成六篇博客。
【自然语言处理】隐马尔可夫模型【Ⅰ】马尔可夫模型
【自然语言处理】隐马尔科夫模型【Ⅱ】隐马尔科夫模型概述
【自然语言处理】隐马尔科夫模型【Ⅲ】估计问题
【自然语言处理】隐马尔科夫模型【Ⅳ】学习问题
【自然语言处理】隐马尔科夫模型【Ⅴ】解码问题
【自然语言处理】隐马尔科夫模型【Ⅵ】精度问题
隐马尔科夫模型的学习,根据训练数据是包括观测序列和对应的状态序列还是只有观测序列,可以分别由监督学习与无监督学习实现。监督学习方法虽然简单且易实现,但是由于监督学习需要使用标注的训练数据,而人工标注训练数据往往代价很高,所以有时会使用无监督学习的方法。HMM 中的无监督学习是指 Baum-Welch 算法,本质上就是 EM 算法,只不过 Baum 和 Welch 提出该算法时,还没有学者将这一类迭代优化算法归纳为 EM 算法。
假设已给训练数据包含 D D D 个长度相同的观测序列和对应的状态序列 { ( O 1 , I 1 ) , ( O 2 , I 2 ) , … , ( O D , I D ) } \{(O_1,I_1),(O_2,I_2), \dots, (O_D,I_D)\} {(O1,I1),(O2,I2),…,(OD,ID)},当数据集足够大时,可以直接将频率视为概率。
转移概率 a i j a_{ij} aij 的估计。设样本中时刻 t t t 处于状态 q i q_i qi 时刻 t + 1 t+1 t+1 转移到状态 q j q_j qj 的频数为 A i j A_{ij} Aij,那么状态转移概率 a i j a_{ij} aij 的估计是
a ^ i j = A i j ∑ j = 1 N A i j , 1 ≤ i , j ≤ N \hat a_{ij} = \frac{A_{ij}}{\sum\limits_{j=1}^NA_{ij}},\space\space\space\space 1\le i,j\le N a^ij=j=1∑NAijAij, 1≤i,j≤N
观测概率 b j ( k ) b_j(k) bj(k) 的估计。设样本中状态为 q j q_j qj 并且观测为 v k v_k vk 的频数为 B j k B_{jk} Bjk,那么状态为 q j q_j qj 观测为 v k v_k vk 的概率 b j ( k ) b_j(k) bj(k) 的估计是
b ^ j ( k ) = B j k ∑ k = 1 M B j k , 1 ≤ k ≤ M ; 1 ≤ j ≤ N \hat b_j(k) = \frac{B_{jk}}{\sum \limits_{k=1}^M B_{jk}},\space\space\space\space 1\le k\le M;1\le j\le N b^j(k)=k=1∑MBjkBjk, 1≤k≤M;1≤j≤N
初始状态概率 π i \pi_i πi 的估计。设样本中状态为 q i q_i qi 的频数为 C i C_i Ci,那么状态为 q i q_i qi 的初始状态概率 π i \pi_i πi 的估计是
π i = C i ∑ j = 1 N C j \pi_i = \frac{C_i}{\sum\limits_{j=1}^{N}C_j} πi=j=1∑NCjCi
假设给定训练数据只包含 D D D 个长度为 T T T 的观测序列 { O 1 , O 2 , … , O D } \{O_1,O_2,\dots, O_D\} {O1,O2,…,OD} 而没有对应的状态序列, O d = { o 1 ( d ) , o 2 ( d ) , … , o T ( d ) } O_d=\{o_1^{(d)},o_2^{(d)},\dots, o_T^{(d)}\} Od={o1(d),o2(d),…,oT(d)}, S d = { s 1 ( d ) , s 2 ( d ) , … , s T ( d ) } S_d = \{ s_1^{(d)},s_2^{(d)},\dots, s_T^{(d)} \} Sd={s1(d),s2(d),…,sT(d)};目标是利用这 D D D 个数据来学习同一个隐马尔可夫模型 λ = ( A , B , π ) \lambda = (A,B,\pi) λ=(A,B,π) 的参数。
根据式 ( 3 ) (3) (3) 计算独立的多观测序列的联合分布概率 P ( O , S ∣ λ ) P(O,S\mid \lambda) P(O,S∣λ)
P ( O , S ∣ λ ) = ∏ d = 1 D ( π s 1 ( d ) ∏ t = 1 T b s t ( d ) ( o t ( d ) ) ∏ t = 1 T − 1 a s t ( d ) s t + 1 ( d ) ) (19) P(O,S\mid \lambda) = \prod_{d=1}^D \left(\pi_{s_1^{(d)}}\prod_{t=1}^T b_{s_t^{(d)}}(o_t^{(d)})\prod_{t=1}^{T-1} a_{s_{t}^{(d)}s_{t+1}^{(d)}} \right) \tag{19} P(O,S∣λ)=d=1∏D(πs1(d)t=1∏Tbst(d)(ot(d))t=1∏T−1ast(d)st+1(d))(19)
EM 算法中的 E 步,求 Q Q Q 函数 Q ( λ , λ ˉ ) Q(\lambda, \bar \lambda) Q(λ,λˉ)
Q ( λ , λ ˉ ) = ∑ S P ( S ∣ O , λ ˉ ) log P ( O , S ∣ λ ) Q(\lambda,\bar \lambda) = \sum_{S} P(S\mid O,\bar \lambda) \log P(O,S\mid \lambda) Q(λ,λˉ)=S∑P(S∣O,λˉ)logP(O,S∣λ)
其中, λ ˉ \bar \lambda λˉ 是隐马尔可夫模型参数的当前估计值, λ \lambda λ 是要极大化的隐马尔可夫模型参数。
EM 算法中的 M 步,极大化 Q Q Q 函数 Q ( λ , λ ˉ ) Q(\lambda, \bar \lambda) Q(λ,λˉ) 求模型参数 A , B , π A,B,\pi A,B,π。极大化 Q Q Q 函数
λ ˉ = a r g max λ ∑ S P ( S ∣ O , λ ˉ ) log P ( O , S ∣ λ ) \bar \lambda = {\rm arg}\max_{\lambda} \sum_{S}P(S\mid O,\bar \lambda) \log P(O,S\mid \lambda) λˉ=argλmaxS∑P(S∣O,λˉ)logP(O,S∣λ)
由于 P ( S ∣ O , λ ˉ ) = P ( O , S ∣ λ ˉ ) / P ( O ∣ λ ˉ ) P(S\mid O,\bar \lambda) = P( O,S\mid \bar \lambda)/P(O\mid \bar \lambda) P(S∣O,λˉ)=P(O,S∣λˉ)/P(O∣λˉ),其中 P ( O ∣ λ ˉ ) P(O\mid \bar \lambda) P(O∣λˉ) 为常数,因此极大化 Q Q Q 函数可以变形为
λ ˉ = a r g max λ ∑ S P ( O , S ∣ λ ˉ ) log P ( O , S ∣ λ ) \bar \lambda = {\rm arg}\max_\lambda \sum_{S} P(O,S\mid \bar \lambda)\log P(O,S\mid \lambda) λˉ=argλmaxS∑P(O,S∣λˉ)logP(O,S∣λ)
将式 ( 19 ) (19) (19) 代入得
λ ˉ = a r g max λ ∑ d = 1 D ∑ S P ( O , S ∣ λ ˉ ) ( log π s 1 ( d ) + ∑ t = 1 T − 1 log a s t ( d ) , s t + 1 ( d ) + ∑ t = 1 T log b s t ( d ) ( o t ( d ) ) ) = a r g max λ ∑ d = 1 D ∑ S P ( O , S ∣ λ ˉ ) log π s 1 ( d ) + ∑ d = 1 D ∑ S ( ∑ t = 1 T − 1 log a s t ( d ) , s t + 1 ( d ) ) P ( O , S ∣ λ ˉ ) + ∑ d = 1 D ∑ S ( ∑ t = 1 T log b s t ( d ) ( o t ( d ) ) ) P ( O , S ∣ λ ˉ ) (20) \begin{aligned} \bar{\lambda} &= {\rm arg} \max_{\lambda}\sum\limits_{d=1}^D\sum\limits_{S}P(O,S\mid \bar{\lambda})\Big(\log\pi_{s_1^{(d)}} + \sum\limits_{t=1}^{T-1}\log a_{s_t^{(d)},s_{t+1}^{(d)}} + \sum\limits_{t=1}^T\log b_{s_t^{(d)}}(o_t^{(d)})\Big) \\ &={\rm arg} \max_{\lambda}\sum\limits_{d=1}^D\sum\limits_{S}P(O,S\mid \bar{\lambda})\log\pi_{s_1^{(d)}} + \sum\limits_{d=1}^D\sum\limits_{S} \left(\sum\limits_{t=1}^{T-1}\log a_{s_t^{(d)},s_{t+1}^{(d)}}\right) P(O,S\mid \bar{\lambda}) + \sum\limits_{d=1}^D\sum\limits_{S}\left(\sum\limits_{t=1}^T\log b_{s_t^{(d)}}(o_t^{(d)})\right)P(O,S\mid \bar{\lambda}) \tag{20} \\ \end{aligned} λˉ=argλmaxd=1∑DS∑P(O,S∣λˉ)(logπs1(d)+t=1∑T−1logast(d),st+1(d)+t=1∑Tlogbst(d)(ot(d)))=argλmaxd=1∑DS∑P(O,S∣λˉ)logπs1(d)+d=1∑DS∑(t=1∑T−1logast(d),st+1(d))P(O,S∣λˉ)+d=1∑DS∑(t=1∑Tlogbst(d)(ot(d)))P(O,S∣λˉ)(20)
由于要极大化的模型参数 A , B , π A,B,\pi A,B,π 在式 ( 20 ) (20) (20) 中单独出现在三个项中,所以只需要对各项分别极大化。
式 ( 20 ) (20) (20) 的第一项可以写成:
∑ d = 1 D ∑ I P ( O , S ∣ λ ˉ ) log π s 1 ( d ) = ∑ d = 1 D ∑ i = 1 N P ( O , s 1 = q i ∣ λ ˉ ) log π i \sum_{d=1}^D\sum_{I} P(O,S\mid \bar \lambda) \log \pi_{s_1^{(d)}} = \sum_{d=1}^D \sum_{i=1}^N P(O,s_1=q_i\mid \bar \lambda) \log \pi_i d=1∑DI∑P(O,S∣λˉ)logπs1(d)=d=1∑Di=1∑NP(O,s1=qi∣λˉ)logπi
注意到 π i \pi_i πi 满足约束条件 ∑ i = 1 N π i = 1 \sum\limits_{i=1}^N \pi_i = 1 i=1∑Nπi=1,利用拉格朗日乘子法,写出拉格朗日函数:
∑ d = 1 D ∑ i = 1 N P ( O , s 1 ( d ) = q i ∣ λ ˉ ) log π i + μ ( ∑ i = 1 N π i − 1 ) \sum_{d=1}^D \sum_{i=1}^N P(O,s_1^{(d)}=q_i\mid \bar \lambda) \log \pi_i + \mu \left( \sum_{i=1}^N \pi_i - 1 \right) d=1∑Di=1∑NP(O,s1(d)=qi∣λˉ)logπi+μ(i=1∑Nπi−1)
对其求偏导数并令结果为零
∂ ∂ π i [ ∑ d = 1 D ∑ i = 1 N P ( O , s 1 ( d ) = q i ∣ λ ˉ ) log π i + μ ( ∑ i = 1 N π i − 1 ) ] = 0 \frac{\partial }{\partial \pi_i}\left[ \sum_{d=1}^D \sum_{i=1}^N P(O,s_1^{(d)}=q_i\mid \bar \lambda) \log \pi_i + \mu \left( \sum_{i=1}^N \pi_i - 1 \right) \right] = 0 ∂πi∂[d=1∑Di=1∑NP(O,s1(d)=qi∣λˉ)logπi+μ(i=1∑Nπi−1)]=0
得
∑ d = 1 D P ( O , s 1 ( d ) = q i ∣ λ ˉ ) + μ π i = 0 \sum\limits_{d=1}^DP(O,s_1^{(d)} =q_i\mid \bar{\lambda}) + \mu\pi_i = 0 d=1∑DP(O,s1(d)=qi∣λˉ)+μπi=0
对 i i i 求和得到 μ \mu μ
μ = − ∑ d = 1 D P ( O ∣ λ ˉ ) \mu = -\sum_{d=1}^D P(O\mid \bar \lambda) μ=−d=1∑DP(O∣λˉ)
代回偏导数为零的式子中消去 μ \mu μ 得到 π i \pi_i πi:
π i = ∑ d = 1 D P ( O , s 1 ( d ) = q i ∣ λ ˉ ) ∑ d = 1 D P ( O ∣ λ ˉ ) = ∑ d = 1 D P ( O , s 1 ( d ) = q i ∣ λ ˉ ) D P ( O ∣ λ ˉ ) = ∑ d = 1 D P ( s 1 ( d ) = q i ∣ O , λ ˉ ) D = ∑ d = 1 D P ( s 1 ( d ) = q i ∣ O d , λ ˉ ) D \pi_i =\frac{\sum\limits_{d=1}^DP(O,s_1^{(d)} =q_i\mid\bar{\lambda})}{\sum\limits_{d=1}^DP(O\mid\bar{\lambda})} = \frac{\sum\limits_{d=1}^DP(O,s_1^{(d)} =q_i\mid\bar{\lambda})}{DP(O\mid\bar{\lambda})} = \frac{\sum\limits_{d=1}^DP(s_1^{(d)} =q_i\mid O, \bar{\lambda})}{D} = \frac{\sum\limits_{d=1}^DP(s_1^{(d)} =q_i\mid O_{d}, \bar{\lambda})}{D} πi=d=1∑DP(O∣λˉ)d=1∑DP(O,s1(d)=qi∣λˉ)=DP(O∣λˉ)d=1∑DP(O,s1(d)=qi∣λˉ)=Dd=1∑DP(s1(d)=qi∣O,λˉ)=Dd=1∑DP(s1(d)=qi∣Od,λˉ)
根据式 ( 15 ) (15) (15) 可得 γ 1 ( d ) ( i ) = P ( s 1 ( d ) = q i ∣ O d , λ ˉ ) \gamma_1^{(d)}(i) = P(s_1^{(d)}=q_i\mid O_{d},\bar \lambda) γ1(d)(i)=P(s1(d)=qi∣Od,λˉ),因此上式可以变形为
π i = ∑ d = 1 D γ 1 ( d ) ( i ) D = 1 D ∑ d = 1 D α 1 ( d ) ( i ) β 1 ( d ) ( i ) P d (21) \pi_i = \frac{\sum\limits_{d=1}^D \gamma_1^{(d)}(i)}{D}=\frac{1}{D}\sum_{d=1}^D\frac{\alpha_1^{(d)}(i) \beta^{(d)}_1(i)}{P_d} \tag{21} πi=Dd=1∑Dγ1(d)(i)=D1d=1∑DPdα1(d)(i)β1(d)(i)(21)
本质上, P d = P ( O d ∣ λ ) P_d=P(O_d\mid \lambda) Pd=P(Od∣λ);但在这里, P d = P ( O d ∣ λ ˉ ) P_d=P(O_d\mid \bar \lambda) Pd=P(Od∣λˉ)。
式 ( 20 ) (20) (20) 中的第二项可以写成:
∑ d = 1 D ∑ S ( ∑ t = 1 T − 1 log a s t ( d ) , s t + 1 ( d ) ) P ( O , S ∣ λ ˉ ) = ∑ d = 1 D ∑ S ∑ t = 1 T − 1 P ( O , S ∣ λ ˉ ) log a s t ( d ) , s t + 1 ( d ) = ∑ d = 1 D ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 P ( O , s t ( d ) = q i , s t + 1 ( d ) = q j ∣ λ ˉ ) log a i j \sum\limits_{d=1}^D\sum\limits_{S} \left(\sum\limits_{t=1}^{T-1}\log a_{s_t^{(d)},s_{t+1}^{(d)}}\right) P(O,S\mid \bar{\lambda}) = \sum\limits_{d=1}^D\sum\limits_{S}\sum\limits_{t=1}^{T-1}P(O,S\mid \bar{\lambda})\log a_{s_t^{(d)},s_{t+1}^{(d)}} = \sum\limits_{d=1}^D\sum\limits_{i=1}^N\sum\limits_{j=1}^N\sum\limits_{t=1}^{T-1}P(O,s_t^{(d)} = q_i, s_{t+1}^{(d)} = q_j\mid\bar{\lambda})\log a_{ij} d=1∑DS∑(t=1∑T−1logast(d),st+1(d))P(O,S∣λˉ)=d=1∑DS∑t=1∑T−1P(O,S∣λˉ)logast(d),st+1(d)=d=1∑Di=1∑Nj=1∑Nt=1∑T−1P(O,st(d)=qi,st+1(d)=qj∣λˉ)logaij
由于 a i j a_{ij} aij 满足约束条件 ∑ j = 1 N a i j = 1 \sum\limits_{j=1}^Na_{ij} = 1 j=1∑Naij=1。和求解 π i \pi_i πi 类似,可以用拉格朗日乘子法对 a i j a_{ij} aij 求偏导数并令结果为零,最终可得
a i j = ∑ d = 1 D ∑ t = 1 T − 1 P ( O d , s t ( d ) = q i , s t + 1 ( d ) = q j ∣ λ ˉ ) ∑ d = 1 D ∑ t = 1 T − 1 P ( O d , s t ( d ) = q i ∣ λ ˉ ) a_{ij} = \frac{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T-1}P(O_{d}, s_t^{(d)} = q_i, s_{t+1}^{(d)} = q_j\mid\bar{\lambda})}{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T-1}P(O_d, s_t^{(d)} = q_i\mid \bar{\lambda})} aij=d=1∑Dt=1∑T−1P(Od,st(d)=qi∣λˉ)d=1∑Dt=1∑T−1P(Od,st(d)=qi,st+1(d)=qj∣λˉ)
利用式 ( 15 ) (15) (15) 和式 ( 17 ) (17) (17),上式变形为
a i j = ∑ d = 1 D ∑ t = 1 T − 1 ξ t ( d ) ( i , j ) ∑ d = 1 D ∑ t = 1 T − 1 γ t ( d ) ( i ) = ∑ d = 1 D 1 P d ∑ t = 1 T − 1 α t ( d ) ( i ) a i j b j ( o t + 1 ( d ) ) β t + 1 ( d ) ( j ) ∑ d = 1 D 1 P d ∑ t = 1 T − 1 α t ( d ) ( i ) β t ( d ) ( i ) (22) a_{ij} = \frac{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T-1}\xi_t^{(d)}(i,j)}{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T-1}\gamma_t^{(d)}(i)} = \frac{\sum\limits_{d=1}^D\frac{1}{P_d}\sum\limits_{t=1}^{T-1} \alpha^{(d)}_t(i)a_{ij}b_j(o_{t+1}^{(d)})\beta^{(d)}_{t+1}(j) }{\sum\limits_{d=1}^D\frac{1}{P_d}\sum\limits_{t=1}^{T-1} \alpha^{(d)}_t(i) \beta^{(d)}_t(i) }\tag{22} aij=d=1∑Dt=1∑T−1γt(d)(i)d=1∑Dt=1∑T−1ξt(d)(i,j)=d=1∑DPd1t=1∑T−1αt(d)(i)βt(d)(i)d=1∑DPd1t=1∑T−1αt(d)(i)aijbj(ot+1(d))βt+1(d)(j)(22)
式 ( 20 ) (20) (20) 中的第三项可以写成:
∑ d = 1 D ∑ S ( ∑ t = 1 T log b s t ( d ) ( o t ( d ) ) ) P ( O , S ∣ λ ˉ ) = ∑ d = 1 D ∑ S ∑ t = 1 T P ( O , S ∣ λ ˉ ) log b s t ( d ) ( o t ( d ) ) = ∑ d = 1 D ∑ j = 1 N ∑ t = 1 T P ( O , s t ( d ) = q j ∣ λ ˉ ) log b j ( o t ( d ) ) \sum\limits_{d=1}^D\sum\limits_{S}\left(\sum\limits_{t=1}^T\log b_{s_t^{(d)}}(o_t^{(d)})\right)P(O,S\mid \bar{\lambda}) = \sum\limits_{d=1}^D\sum\limits_{S}\sum\limits_{t=1}^{T}P(O,S\mid\bar{\lambda})\log b_{s_t^{(d)}}(o_t^{(d)}) = \sum\limits_{d=1}^D\sum\limits_{j=1}^N\sum\limits_{t=1}^{T}P(O,s_t^{(d)} = q_j\mid \bar{\lambda})\log b_{j}(o_t^{(d)}) d=1∑DS∑(t=1∑Tlogbst(d)(ot(d)))P(O,S∣λˉ)=d=1∑DS∑t=1∑TP(O,S∣λˉ)logbst(d)(ot(d))=d=1∑Dj=1∑Nt=1∑TP(O,st(d)=qj∣λˉ)logbj(ot(d))
由于 b j ( k ) b_{j}(k) bj(k) 满足约束条件 ∑ k = 1 M b j ( k ) = 1 \sum\limits_{k=1}^M b_j(k)=1 k=1∑Mbj(k)=1。和求解 π i \pi_i πi 类似,可以用拉格朗日乘子法对 b j ( k ) b_j(k) bj(k) 求偏导数并令结果为零。注意,只有在 o t ( d ) = v k o_t^{(d)} = v_k ot(d)=vk 时 b j ( o t ( d ) ) b_j(o_t^{(d)}) bj(ot(d)) 对 b j ( k ) b_j(k) bj(k) 的偏导数才不为零,以 I ( o t ( d ) = v k ) I(o_t^{(d)}=v_k) I(ot(d)=vk) 表示。可得
b j ( k ) = ∑ d = 1 D ∑ t = 1 T P ( O d , s t = q j ∣ λ ˉ ) I ( o t ( d ) = v k ) ∑ d = 1 D ∑ t = 1 T P ( O d , s t = q j ∣ λ ˉ ) b_j(k) = \frac{\sum\limits_{d=1}^D \sum\limits_{t=1}^T P(O_d,s_t=q_j\mid \bar \lambda) I(o_t^{(d)} = v_k)}{\sum\limits_{d=1}^D \sum\limits_{t=1}^T P(O_d,s_t = q_j\mid \bar \lambda)} bj(k)=d=1∑Dt=1∑TP(Od,st=qj∣λˉ)d=1∑Dt=1∑TP(Od,st=qj∣λˉ)I(ot(d)=vk)
代入式 ( 15 ) (15) (15) 可得
b j ( k ) = ∑ d = 1 D ∑ t = 1 , o t ( d ) = v k T γ t ( d ) ( j ) ∑ d = 1 D ∑ t = 1 T γ t ( d ) ( j ) = ∑ d = 1 D 1 P d ∑ t = 1 , o t ( d ) = v k T α t ( d ) ( i ) β t ( d ) ( i ) ∑ d = 1 D 1 P d ∑ t = 1 T α t ( d ) ( i ) β t ( d ) ( i ) (23) b_{j}(k) = \frac{\sum\limits_{d=1}^D\sum\limits_{t=1, o_t^{(d)}=v_k}^{T}\gamma_t^{(d)}(j)}{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T}\gamma_t^{(d)}(j)} = \frac{ \sum\limits_{d=1}^D \frac{1}{P_d} \sum\limits_{t=1, o^{(d)}_t=v_k}^{T} \alpha_t^{(d)}(i)\beta_t^{(d)}(i) } { \sum\limits_{d=1}^D \frac{1}{P_d} \sum\limits_{t=1}^{T} \alpha_t^{(d)}(i) \beta_t^{(d)}(i) } \tag{23} bj(k)=d=1∑Dt=1∑Tγt(d)(j)d=1∑Dt=1,ot(d)=vk∑Tγt(d)(j)=d=1∑DPd1t=1∑Tαt(d)(i)βt(d)(i)d=1∑DPd1t=1,ot(d)=vk∑Tαt(d)(i)βt(d)(i)(23)
Baum-Welch 算法过程大致描述为,对模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π) 进行初始化,利用公式 ( 21 ) ∼ ( 23 ) (21)\sim(23) (21)∼(23) 计算出新的模型参数,不断迭代直至满足终止条件。