【自然语言处理】隐马尔科夫模型【Ⅳ】学习问题

有任何的书写错误、排版错误、概念错误等,希望大家包含指正。

由于字数限制,分成六篇博客。
【自然语言处理】隐马尔可夫模型【Ⅰ】马尔可夫模型
【自然语言处理】隐马尔科夫模型【Ⅱ】隐马尔科夫模型概述
【自然语言处理】隐马尔科夫模型【Ⅲ】估计问题
【自然语言处理】隐马尔科夫模型【Ⅳ】学习问题
【自然语言处理】隐马尔科夫模型【Ⅴ】解码问题
【自然语言处理】隐马尔科夫模型【Ⅵ】精度问题

2.4. 学习算法

隐马尔科夫模型的学习,根据训练数据是包括观测序列和对应的状态序列还是只有观测序列,可以分别由监督学习与无监督学习实现。监督学习方法虽然简单且易实现,但是由于监督学习需要使用标注的训练数据,而人工标注训练数据往往代价很高,所以有时会使用无监督学习的方法。HMM 中的无监督学习是指 Baum-Welch 算法,本质上就是 EM 算法,只不过 Baum 和 Welch 提出该算法时,还没有学者将这一类迭代优化算法归纳为 EM 算法。

2.4.1. 监督学习

假设已给训练数据包含 D D D 个长度相同的观测序列和对应的状态序列 { ( O 1 , I 1 ) , ( O 2 , I 2 ) , … , ( O D , I D ) } \{(O_1,I_1),(O_2,I_2), \dots, (O_D,I_D)\} {(O1,I1),(O2,I2),,(OD,ID)},当数据集足够大时,可以直接将频率视为概率。

转移概率 a i j a_{ij} aij 的估计。设样本中时刻 t t t 处于状态 q i q_i qi 时刻 t + 1 t+1 t+1 转移到状态 q j q_j qj 的频数为 A i j A_{ij} Aij,那么状态转移概率 a i j a_{ij} aij 的估计是
a ^ i j = A i j ∑ j = 1 N A i j ,      1 ≤ i , j ≤ N \hat a_{ij} = \frac{A_{ij}}{\sum\limits_{j=1}^NA_{ij}},\space\space\space\space 1\le i,j\le N a^ij=j=1NAijAij,    1i,jN
观测概率 b j ( k ) b_j(k) bj(k) 的估计。设样本中状态为 q j q_j qj 并且观测为 v k v_k vk 的频数为 B j k B_{jk} Bjk,那么状态为 q j q_j qj 观测为 v k v_k vk 的概率 b j ( k ) b_j(k) bj(k) 的估计是
b ^ j ( k ) = B j k ∑ k = 1 M B j k ,      1 ≤ k ≤ M ; 1 ≤ j ≤ N \hat b_j(k) = \frac{B_{jk}}{\sum \limits_{k=1}^M B_{jk}},\space\space\space\space 1\le k\le M;1\le j\le N b^j(k)=k=1MBjkBjk,    1kM;1jN
初始状态概率 π i \pi_i πi 的估计。设样本中状态为 q i q_i qi 的频数为 C i C_i Ci,那么状态为 q i q_i qi 的初始状态概率 π i \pi_i πi 的估计是
π i = C i ∑ j = 1 N C j \pi_i = \frac{C_i}{\sum\limits_{j=1}^{N}C_j} πi=j=1NCjCi

2.4.2. Baum-Welch 算法

假设给定训练数据只包含 D D D 个长度为 T T T 的观测序列 { O 1 , O 2 , … , O D } \{O_1,O_2,\dots, O_D\} {O1,O2,,OD} 而没有对应的状态序列, O d = { o 1 ( d ) , o 2 ( d ) , … , o T ( d ) } O_d=\{o_1^{(d)},o_2^{(d)},\dots, o_T^{(d)}\} Od={o1(d),o2(d),,oT(d)} S d = { s 1 ( d ) , s 2 ( d ) , … , s T ( d ) } S_d = \{ s_1^{(d)},s_2^{(d)},\dots, s_T^{(d)} \} Sd={s1(d),s2(d),,sT(d)};目标是利用这 D D D 个数据来学习同一个隐马尔可夫模型 λ = ( A , B , π ) \lambda = (A,B,\pi) λ=(A,B,π) 的参数。

根据式 ( 3 ) (3) (3) 计算独立的多观测序列的联合分布概率 P ( O , S ∣ λ ) P(O,S\mid \lambda) P(O,Sλ)
P ( O , S ∣ λ ) = ∏ d = 1 D ( π s 1 ( d ) ∏ t = 1 T b s t ( d ) ( o t ( d ) ) ∏ t = 1 T − 1 a s t ( d ) s t + 1 ( d ) ) (19) P(O,S\mid \lambda) = \prod_{d=1}^D \left(\pi_{s_1^{(d)}}\prod_{t=1}^T b_{s_t^{(d)}}(o_t^{(d)})\prod_{t=1}^{T-1} a_{s_{t}^{(d)}s_{t+1}^{(d)}} \right) \tag{19} P(O,Sλ)=d=1D(πs1(d)t=1Tbst(d)(ot(d))t=1T1ast(d)st+1(d))(19)
EM 算法中的 E 步,求 Q Q Q 函数 Q ( λ , λ ˉ ) Q(\lambda, \bar \lambda) Q(λ,λˉ)
Q ( λ , λ ˉ ) = ∑ S P ( S ∣ O , λ ˉ ) log ⁡ P ( O , S ∣ λ ) Q(\lambda,\bar \lambda) = \sum_{S} P(S\mid O,\bar \lambda) \log P(O,S\mid \lambda) Q(λ,λˉ)=SP(SO,λˉ)logP(O,Sλ)
其中, λ ˉ \bar \lambda λˉ 是隐马尔可夫模型参数的当前估计值, λ \lambda λ 是要极大化的隐马尔可夫模型参数。

EM 算法中的 M 步,极大化 Q Q Q 函数 Q ( λ , λ ˉ ) Q(\lambda, \bar \lambda) Q(λ,λˉ) 求模型参数 A , B , π A,B,\pi A,B,π。极大化 Q Q Q 函数
λ ˉ = a r g max ⁡ λ ∑ S P ( S ∣ O , λ ˉ ) log ⁡ P ( O , S ∣ λ ) \bar \lambda = {\rm arg}\max_{\lambda} \sum_{S}P(S\mid O,\bar \lambda) \log P(O,S\mid \lambda) λˉ=argλmaxSP(SO,λˉ)logP(O,Sλ)
由于 P ( S ∣ O , λ ˉ ) = P ( O , S ∣ λ ˉ ) / P ( O ∣ λ ˉ ) P(S\mid O,\bar \lambda) = P( O,S\mid \bar \lambda)/P(O\mid \bar \lambda) P(SO,λˉ)=P(O,Sλˉ)/P(Oλˉ),其中 P ( O ∣ λ ˉ ) P(O\mid \bar \lambda) P(Oλˉ) 为常数,因此极大化 Q Q Q 函数可以变形为
λ ˉ = a r g max ⁡ λ ∑ S P ( O , S ∣ λ ˉ ) log ⁡ P ( O , S ∣ λ ) \bar \lambda = {\rm arg}\max_\lambda \sum_{S} P(O,S\mid \bar \lambda)\log P(O,S\mid \lambda) λˉ=argλmaxSP(O,Sλˉ)logP(O,Sλ)
将式 ( 19 ) (19) (19) 代入得
λ ˉ = a r g max ⁡ λ ∑ d = 1 D ∑ S P ( O , S ∣ λ ˉ ) ( log ⁡ π s 1 ( d ) + ∑ t = 1 T − 1 log ⁡ a s t ( d ) , s t + 1 ( d ) + ∑ t = 1 T log ⁡ b s t ( d ) ( o t ( d ) ) ) = a r g max ⁡ λ ∑ d = 1 D ∑ S P ( O , S ∣ λ ˉ ) log ⁡ π s 1 ( d ) + ∑ d = 1 D ∑ S ( ∑ t = 1 T − 1 log ⁡ a s t ( d ) , s t + 1 ( d ) ) P ( O , S ∣ λ ˉ ) + ∑ d = 1 D ∑ S ( ∑ t = 1 T log ⁡ b s t ( d ) ( o t ( d ) ) ) P ( O , S ∣ λ ˉ ) (20) \begin{aligned} \bar{\lambda} &= {\rm arg} \max_{\lambda}\sum\limits_{d=1}^D\sum\limits_{S}P(O,S\mid \bar{\lambda})\Big(\log\pi_{s_1^{(d)}} + \sum\limits_{t=1}^{T-1}\log a_{s_t^{(d)},s_{t+1}^{(d)}} + \sum\limits_{t=1}^T\log b_{s_t^{(d)}}(o_t^{(d)})\Big) \\ &={\rm arg} \max_{\lambda}\sum\limits_{d=1}^D\sum\limits_{S}P(O,S\mid \bar{\lambda})\log\pi_{s_1^{(d)}} + \sum\limits_{d=1}^D\sum\limits_{S} \left(\sum\limits_{t=1}^{T-1}\log a_{s_t^{(d)},s_{t+1}^{(d)}}\right) P(O,S\mid \bar{\lambda}) + \sum\limits_{d=1}^D\sum\limits_{S}\left(\sum\limits_{t=1}^T\log b_{s_t^{(d)}}(o_t^{(d)})\right)P(O,S\mid \bar{\lambda}) \tag{20} \\ \end{aligned} λˉ=argλmaxd=1DSP(O,Sλˉ)(logπs1(d)+t=1T1logast(d),st+1(d)+t=1Tlogbst(d)(ot(d)))=argλmaxd=1DSP(O,Sλˉ)logπs1(d)+d=1DS(t=1T1logast(d),st+1(d))P(O,Sλˉ)+d=1DS(t=1Tlogbst(d)(ot(d)))P(O,Sλˉ)(20)
由于要极大化的模型参数 A , B , π A,B,\pi A,B,π 在式 ( 20 ) (20) (20) 中单独出现在三个项中,所以只需要对各项分别极大化。

( 20 ) (20) (20) 的第一项可以写成:
∑ d = 1 D ∑ I P ( O , S ∣ λ ˉ ) log ⁡ π s 1 ( d ) = ∑ d = 1 D ∑ i = 1 N P ( O , s 1 = q i ∣ λ ˉ ) log ⁡ π i \sum_{d=1}^D\sum_{I} P(O,S\mid \bar \lambda) \log \pi_{s_1^{(d)}} = \sum_{d=1}^D \sum_{i=1}^N P(O,s_1=q_i\mid \bar \lambda) \log \pi_i d=1DIP(O,Sλˉ)logπs1(d)=d=1Di=1NP(O,s1=qiλˉ)logπi
注意到 π i \pi_i πi 满足约束条件 ∑ i = 1 N π i = 1 \sum\limits_{i=1}^N \pi_i = 1 i=1Nπi=1,利用拉格朗日乘子法,写出拉格朗日函数:
∑ d = 1 D ∑ i = 1 N P ( O , s 1 ( d ) = q i ∣ λ ˉ ) log ⁡ π i + μ ( ∑ i = 1 N π i − 1 ) \sum_{d=1}^D \sum_{i=1}^N P(O,s_1^{(d)}=q_i\mid \bar \lambda) \log \pi_i + \mu \left( \sum_{i=1}^N \pi_i - 1 \right) d=1Di=1NP(O,s1(d)=qiλˉ)logπi+μ(i=1Nπi1)
对其求偏导数并令结果为零
∂ ∂ π i [ ∑ d = 1 D ∑ i = 1 N P ( O , s 1 ( d ) = q i ∣ λ ˉ ) log ⁡ π i + μ ( ∑ i = 1 N π i − 1 ) ] = 0 \frac{\partial }{\partial \pi_i}\left[ \sum_{d=1}^D \sum_{i=1}^N P(O,s_1^{(d)}=q_i\mid \bar \lambda) \log \pi_i + \mu \left( \sum_{i=1}^N \pi_i - 1 \right) \right] = 0 πi[d=1Di=1NP(O,s1(d)=qiλˉ)logπi+μ(i=1Nπi1)]=0

∑ d = 1 D P ( O , s 1 ( d ) = q i ∣ λ ˉ ) + μ π i = 0 \sum\limits_{d=1}^DP(O,s_1^{(d)} =q_i\mid \bar{\lambda}) + \mu\pi_i = 0 d=1DP(O,s1(d)=qiλˉ)+μπi=0
i i i 求和得到 μ \mu μ
μ = − ∑ d = 1 D P ( O ∣ λ ˉ ) \mu = -\sum_{d=1}^D P(O\mid \bar \lambda) μ=d=1DP(Oλˉ)
代回偏导数为零的式子中消去 μ \mu μ 得到 π i \pi_i πi
π i = ∑ d = 1 D P ( O , s 1 ( d ) = q i ∣ λ ˉ ) ∑ d = 1 D P ( O ∣ λ ˉ ) = ∑ d = 1 D P ( O , s 1 ( d ) = q i ∣ λ ˉ ) D P ( O ∣ λ ˉ ) = ∑ d = 1 D P ( s 1 ( d ) = q i ∣ O , λ ˉ ) D = ∑ d = 1 D P ( s 1 ( d ) = q i ∣ O d , λ ˉ ) D \pi_i =\frac{\sum\limits_{d=1}^DP(O,s_1^{(d)} =q_i\mid\bar{\lambda})}{\sum\limits_{d=1}^DP(O\mid\bar{\lambda})} = \frac{\sum\limits_{d=1}^DP(O,s_1^{(d)} =q_i\mid\bar{\lambda})}{DP(O\mid\bar{\lambda})} = \frac{\sum\limits_{d=1}^DP(s_1^{(d)} =q_i\mid O, \bar{\lambda})}{D} = \frac{\sum\limits_{d=1}^DP(s_1^{(d)} =q_i\mid O_{d}, \bar{\lambda})}{D} πi=d=1DP(Oλˉ)d=1DP(O,s1(d)=qiλˉ)=DP(Oλˉ)d=1DP(O,s1(d)=qiλˉ)=Dd=1DP(s1(d)=qiO,λˉ)=Dd=1DP(s1(d)=qiOd,λˉ)
根据式 ( 15 ) (15) (15) 可得 γ 1 ( d ) ( i ) = P ( s 1 ( d ) = q i ∣ O d , λ ˉ ) \gamma_1^{(d)}(i) = P(s_1^{(d)}=q_i\mid O_{d},\bar \lambda) γ1(d)(i)=P(s1(d)=qiOd,λˉ),因此上式可以变形为
π i = ∑ d = 1 D γ 1 ( d ) ( i ) D = 1 D ∑ d = 1 D α 1 ( d ) ( i ) β 1 ( d ) ( i ) P d (21) \pi_i = \frac{\sum\limits_{d=1}^D \gamma_1^{(d)}(i)}{D}=\frac{1}{D}\sum_{d=1}^D\frac{\alpha_1^{(d)}(i) \beta^{(d)}_1(i)}{P_d} \tag{21} πi=Dd=1Dγ1(d)(i)=D1d=1DPdα1(d)(i)β1(d)(i)(21)
本质上, P d = P ( O d ∣ λ ) P_d=P(O_d\mid \lambda) Pd=P(Odλ);但在这里, P d = P ( O d ∣ λ ˉ ) P_d=P(O_d\mid \bar \lambda) Pd=P(Odλˉ)

( 20 ) (20) (20) 中的第二项可以写成:
∑ d = 1 D ∑ S ( ∑ t = 1 T − 1 log ⁡ a s t ( d ) , s t + 1 ( d ) ) P ( O , S ∣ λ ˉ ) = ∑ d = 1 D ∑ S ∑ t = 1 T − 1 P ( O , S ∣ λ ˉ ) log ⁡ a s t ( d ) , s t + 1 ( d ) = ∑ d = 1 D ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 P ( O , s t ( d ) = q i , s t + 1 ( d ) = q j ∣ λ ˉ ) log ⁡ a i j \sum\limits_{d=1}^D\sum\limits_{S} \left(\sum\limits_{t=1}^{T-1}\log a_{s_t^{(d)},s_{t+1}^{(d)}}\right) P(O,S\mid \bar{\lambda}) = \sum\limits_{d=1}^D\sum\limits_{S}\sum\limits_{t=1}^{T-1}P(O,S\mid \bar{\lambda})\log a_{s_t^{(d)},s_{t+1}^{(d)}} = \sum\limits_{d=1}^D\sum\limits_{i=1}^N\sum\limits_{j=1}^N\sum\limits_{t=1}^{T-1}P(O,s_t^{(d)} = q_i, s_{t+1}^{(d)} = q_j\mid\bar{\lambda})\log a_{ij} d=1DS(t=1T1logast(d),st+1(d))P(O,Sλˉ)=d=1DSt=1T1P(O,Sλˉ)logast(d),st+1(d)=d=1Di=1Nj=1Nt=1T1P(O,st(d)=qi,st+1(d)=qjλˉ)logaij
由于 a i j a_{ij} aij 满足约束条件 ∑ j = 1 N a i j = 1 \sum\limits_{j=1}^Na_{ij} = 1 j=1Naij=1。和求解 π i \pi_i πi 类似,可以用拉格朗日乘子法对 a i j a_{ij} aij 求偏导数并令结果为零,最终可得
a i j = ∑ d = 1 D ∑ t = 1 T − 1 P ( O d , s t ( d ) = q i , s t + 1 ( d ) = q j ∣ λ ˉ ) ∑ d = 1 D ∑ t = 1 T − 1 P ( O d , s t ( d ) = q i ∣ λ ˉ ) a_{ij} = \frac{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T-1}P(O_{d}, s_t^{(d)} = q_i, s_{t+1}^{(d)} = q_j\mid\bar{\lambda})}{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T-1}P(O_d, s_t^{(d)} = q_i\mid \bar{\lambda})} aij=d=1Dt=1T1P(Od,st(d)=qiλˉ)d=1Dt=1T1P(Od,st(d)=qi,st+1(d)=qjλˉ)
利用式 ( 15 ) (15) (15) 和式 ( 17 ) (17) (17),上式变形为
a i j = ∑ d = 1 D ∑ t = 1 T − 1 ξ t ( d ) ( i , j ) ∑ d = 1 D ∑ t = 1 T − 1 γ t ( d ) ( i ) = ∑ d = 1 D 1 P d ∑ t = 1 T − 1 α t ( d ) ( i ) a i j b j ( o t + 1 ( d ) ) β t + 1 ( d ) ( j ) ∑ d = 1 D 1 P d ∑ t = 1 T − 1 α t ( d ) ( i ) β t ( d ) ( i ) (22) a_{ij} = \frac{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T-1}\xi_t^{(d)}(i,j)}{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T-1}\gamma_t^{(d)}(i)} = \frac{\sum\limits_{d=1}^D\frac{1}{P_d}\sum\limits_{t=1}^{T-1} \alpha^{(d)}_t(i)a_{ij}b_j(o_{t+1}^{(d)})\beta^{(d)}_{t+1}(j) }{\sum\limits_{d=1}^D\frac{1}{P_d}\sum\limits_{t=1}^{T-1} \alpha^{(d)}_t(i) \beta^{(d)}_t(i) }\tag{22} aij=d=1Dt=1T1γt(d)(i)d=1Dt=1T1ξt(d)(i,j)=d=1DPd1t=1T1αt(d)(i)βt(d)(i)d=1DPd1t=1T1αt(d)(i)aijbj(ot+1(d))βt+1(d)(j)(22)
( 20 ) (20) (20) 中的第三项可以写成:
∑ d = 1 D ∑ S ( ∑ t = 1 T log ⁡ b s t ( d ) ( o t ( d ) ) ) P ( O , S ∣ λ ˉ ) = ∑ d = 1 D ∑ S ∑ t = 1 T P ( O , S ∣ λ ˉ ) log ⁡ b s t ( d ) ( o t ( d ) ) = ∑ d = 1 D ∑ j = 1 N ∑ t = 1 T P ( O , s t ( d ) = q j ∣ λ ˉ ) log ⁡ b j ( o t ( d ) ) \sum\limits_{d=1}^D\sum\limits_{S}\left(\sum\limits_{t=1}^T\log b_{s_t^{(d)}}(o_t^{(d)})\right)P(O,S\mid \bar{\lambda}) = \sum\limits_{d=1}^D\sum\limits_{S}\sum\limits_{t=1}^{T}P(O,S\mid\bar{\lambda})\log b_{s_t^{(d)}}(o_t^{(d)}) = \sum\limits_{d=1}^D\sum\limits_{j=1}^N\sum\limits_{t=1}^{T}P(O,s_t^{(d)} = q_j\mid \bar{\lambda})\log b_{j}(o_t^{(d)}) d=1DS(t=1Tlogbst(d)(ot(d)))P(O,Sλˉ)=d=1DSt=1TP(O,Sλˉ)logbst(d)(ot(d))=d=1Dj=1Nt=1TP(O,st(d)=qjλˉ)logbj(ot(d))
由于 b j ( k ) b_{j}(k) bj(k) 满足约束条件 ∑ k = 1 M b j ( k ) = 1 \sum\limits_{k=1}^M b_j(k)=1 k=1Mbj(k)=1。和求解 π i \pi_i πi 类似,可以用拉格朗日乘子法对 b j ( k ) b_j(k) bj(k) 求偏导数并令结果为零。注意,只有在 o t ( d ) = v k o_t^{(d)} = v_k ot(d)=vk b j ( o t ( d ) ) b_j(o_t^{(d)}) bj(ot(d)) b j ( k ) b_j(k) bj(k) 的偏导数才不为零,以 I ( o t ( d ) = v k ) I(o_t^{(d)}=v_k) I(ot(d)=vk) 表示。可得
b j ( k ) = ∑ d = 1 D ∑ t = 1 T P ( O d , s t = q j ∣ λ ˉ ) I ( o t ( d ) = v k ) ∑ d = 1 D ∑ t = 1 T P ( O d , s t = q j ∣ λ ˉ ) b_j(k) = \frac{\sum\limits_{d=1}^D \sum\limits_{t=1}^T P(O_d,s_t=q_j\mid \bar \lambda) I(o_t^{(d)} = v_k)}{\sum\limits_{d=1}^D \sum\limits_{t=1}^T P(O_d,s_t = q_j\mid \bar \lambda)} bj(k)=d=1Dt=1TP(Od,st=qjλˉ)d=1Dt=1TP(Od,st=qjλˉ)I(ot(d)=vk)
代入式 ( 15 ) (15) (15) 可得
b j ( k ) = ∑ d = 1 D ∑ t = 1 , o t ( d ) = v k T γ t ( d ) ( j ) ∑ d = 1 D ∑ t = 1 T γ t ( d ) ( j ) = ∑ d = 1 D 1 P d ∑ t = 1 , o t ( d ) = v k T α t ( d ) ( i ) β t ( d ) ( i ) ∑ d = 1 D 1 P d ∑ t = 1 T α t ( d ) ( i ) β t ( d ) ( i ) (23) b_{j}(k) = \frac{\sum\limits_{d=1}^D\sum\limits_{t=1, o_t^{(d)}=v_k}^{T}\gamma_t^{(d)}(j)}{\sum\limits_{d=1}^D\sum\limits_{t=1}^{T}\gamma_t^{(d)}(j)} = \frac{ \sum\limits_{d=1}^D \frac{1}{P_d} \sum\limits_{t=1, o^{(d)}_t=v_k}^{T} \alpha_t^{(d)}(i)\beta_t^{(d)}(i) } { \sum\limits_{d=1}^D \frac{1}{P_d} \sum\limits_{t=1}^{T} \alpha_t^{(d)}(i) \beta_t^{(d)}(i) } \tag{23} bj(k)=d=1Dt=1Tγt(d)(j)d=1Dt=1,ot(d)=vkTγt(d)(j)=d=1DPd1t=1Tαt(d)(i)βt(d)(i)d=1DPd1t=1,ot(d)=vkTαt(d)(i)βt(d)(i)(23)
Baum-Welch 算法过程大致描述为,对模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π) 进行初始化,利用公式 ( 21 ) ∼ ( 23 ) (21)\sim(23) (21)(23) 计算出新的模型参数,不断迭代直至满足终止条件。

你可能感兴趣的:(【机器学习】,【自然语言处理】,自然语言处理,人工智能,算法,概率论)