隐马尔可夫模型(hidden Markov model, HMM)是可用于标注问题的统计学模型,是生成模型。
本章节内容参考李航博士的《统计学习方法》
本章节添加了一些结论性结果的推导过程。
例如有三个个句子:
句子一:我/名词 看见/动词 猫/名词
句子二:猫/名词 是/动词 可爱的/形容词
句子三:我/名词 是/动词 可爱的/形容词
一般只能观察到具体的词,所以像"我 看见 猫 …"是观测集合,而词性如"名词 动词 形容词 …"是状态序列
设 Q Q Q是所有可能的状态集合, V V V是所有可能的观测集合:
Q = { q 1 , q 2 , . . . , q N } , V = { v 1 , v 2 , . . . , v M } Q = \{q_1, q_2, ..., q_N\}, V=\{v_1, v_2, ..., v_M\} Q={q1,q2,...,qN},V={v1,v2,...,vM}
其中, N是可能的状态数,M是可能的观测数。
例如: Q = { 名 词 , 动 词 , 形 容 词 } , V = { 我 , 看 见 , 猫 , 是 , 可 爱 的 } , N = 3 , M = 5 Q=\{名词,动词,形容词 \},V=\{我, 看见, 猫, 是,可爱的\},N=3, M=5 Q={名词,动词,形容词},V={我,看见,猫,是,可爱的},N=3,M=5
I I I是长度为 T T T的状态序列, O O O是对应的观测序列:
I = { i 1 , i 2 , . . . , i T } , O = { o 1 , o 2 , . . . , o T } I = \{i_1, i_2,..., i_T \}, O=\{o_1, o_2,..., o_T\} I={i1,i2,...,iT},O={o1,o2,...,oT}
例如: I = ( 名 词 , 动 词 , 名 词 ) , O = ( 我 , 看 见 , 猫 ) I=(名词,动词,名词), O=(我,看见,猫) I=(名词,动词,名词),O=(我,看见,猫)
A A A是状态转移矩阵:
A = [ a i j ] N ∗ N (1) A=[a_{ij}]_{N*N} \tag1 A=[aij]N∗N(1)
其中,
a i j = p ( i t + 1 = q j ∣ i t = q i ) , i = 1 , 2 , . . . , N ; j = 1 , 2 , . . . , N (2) a_{ij} = p(i_{t+1}=q_j|i_t=q_i), i=1,2,...,N; j=1,2,...,N \tag2 aij=p(it+1=qj∣it=qi),i=1,2,...,N;j=1,2,...,N(2)
例如:
转态转移概率 | 名词 | 动词 | 形容词 |
---|---|---|---|
名词 | 0 | 1 | 0 |
动词 | 1/3 | 0 | 2/3 |
形容词 | 1/3 | 1/3 | 1/3 |
B B B是观测概率矩阵,也就是发射矩阵:
B = [ b j ( k ) ] N ∗ M (3) B=[b_j(k)]_{N*M} \tag3 B=[bj(k)]N∗M(3)
其中,
b j ( k ) = p ( o t = v k ∣ i t = q j ) , k = 1 , 2 , . . . , M ; j = 1 , 2 , . . . , N (4) b_j(k) = p(o_t=v_k|i_t=q_j), k=1,2,...,M; j=1,2,...,N \tag4 bj(k)=p(ot=vk∣it=qj),k=1,2,...,M;j=1,2,...,N(4)
例如:
观测矩阵概率 | 我 | 看见 | 猫 | 是 | 可爱的 |
---|---|---|---|---|---|
名词 | 1 | 0 | 1 | 0 | 0 |
动词 | 0 | 1 | 0 | 1 | 0 |
形容词 | 0 | 0 | 0 | 0 | 1 |
π \pi π是初始状态概率向量:
π = ( π i ) (5) \pi = (\pi_i) \tag5 π=(πi)(5)
其中,
π i = p ( i 1 = q i ) , i = 1 , 2 , . . . , N (6) \pi_i = p(i_1 = q_i), i = 1,2,...,N \tag6 πi=p(i1=qi),i=1,2,...,N(6)
A , B A,B A,B和 π \pi π是HMM的参数,用 λ \lambda λ表示:
λ = ( A , B , π ) (7) \lambda = (A,B,\pi) \tag7 λ=(A,B,π)(7)
例如:
名词 | 动词 | 形容词 |
---|---|---|
1 | 0 | 0 |
隐马尔可夫的三个基本问题
1.概率计算问题。给定模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π)和观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),计算在已知模型参数的情况下,观测序列的概率,即 p ( O ∣ λ ) p(O|\lambda) p(O∣λ)。
2.学习问题。已知观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),估计模型参数 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π),使 p ( O ∣ λ ) p(O|\lambda) p(O∣λ)最大。
3.预测问题,也称解码问题。已知模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π)和 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),求条件概率最大 p ( I ∣ O ) p(I|O) p(I∣O)最大的状态序列 I = ( i 1 , i 2 , . . . , i T ) I=(i_1,i_2,...,i_T) I=(i1,i2,...,iT)。
概率问题预测用直接计算法,计算复杂度高,可以采用动态规划形式的前向和后向算法降低计算复杂度。
为了表示方便,记:
( o 1 : t ) = ( o 1 , o 2 , . . . , o n ) ; ( o t : T ) = ( o t , o t + 1 , . . . , o T ) (o_{1:t} )= (o_1,o_2,...,o_n); (o_{t_:T})=(o_t,o_{t+1},...,o_T) (o1:t)=(o1,o2,...,on);(ot:T)=(ot,ot+1,...,oT)
接下来就是解前向概率 p ( i t , o 1 : t ∣ λ ) p(i_t,o_{1:t}|\lambda) p(it,o1:t∣λ):
p ( i t , o 1 : t ∣ λ ) = ∑ i t − 1 p ( i t − 1 , i t , o 1 : t − 1 , o t ∣ λ ) = ∑ i t − 1 p ( o t ∣ i t − 1 , i t , o 1 : t − 1 , λ ) p ( i t ∣ i t − 1 , o 1 : t − 1 , λ ) p ( i t − 1 , o 1 : t − 1 ∣ λ ) \begin{aligned} p(i_t,o_{1:t}|\lambda) &=\sum_{i_{t-1}} p(i_{t-1},i_t,o_{1:t-1},o_t|\lambda) \\ &=\sum_{i_{t-1}} p(o_t|i_{t-1},i_t,o_{1:t-1},\lambda)p(i_t|i_{t-1},o_{1:t-1},\lambda)p(i_{t-1},o_{1:t-1}|\lambda) \end{aligned} p(it,o1:t∣λ)=it−1∑p(it−1,it,o1:t−1,ot∣λ)=it−1∑p(ot∣it−1,it,o1:t−1,λ)p(it∣it−1,o1:t−1,λ)p(it−1,o1:t−1∣λ)
由隐马尔科夫的条件独立性假设可得:
p ( o t ∣ i t − 1 , i t , o 1 : t − 1 , λ ) = p ( o t ∣ i t , λ ) p(o_t|i_{t-1},i_t,o_{1:t-1},\lambda) = p(o_t|i_t,\lambda) p(ot∣it−1,it,o1:t−1,λ)=p(ot∣it,λ)
p ( i t ∣ i t − 1 , o 1 : t − 1 , λ ) = p ( i t ∣ i t − 1 , λ ) p(i_t|i_{t-1},o_{1:t-1},\lambda)=p(i_t|i_{t-1},\lambda) p(it∣it−1,o1:t−1,λ)=p(it∣it−1,λ)
故
p ( i t , o 1 : t ∣ λ ) = ∑ i t − 1 p ( o t ∣ i t , λ ) p ( i t ∣ i t − 1 , λ ) p ( i t − 1 , o 1 : t − 1 ∣ λ ) = [ ∑ i t − 1 p ( i t − 1 , o 1 : t − 1 ∣ λ ) p ( i t ∣ i t − 1 , λ ) ] p ( o t ∣ i t , λ ) p(i_t,o_{1:t}|\lambda)=\sum_{i_{t-1}} p(o_t|i_t,\lambda) p(i_t|i_{t-1},\lambda)p(i_{t-1},o_{1:t-1}|\lambda)=[\sum_{i_{t-1} } p(i_{t-1},o_{1:t-1}|\lambda) p(i_t|i_{t-1},\lambda)] p(o_t|i_t,\lambda) p(it,o1:t∣λ)=it−1∑p(ot∣it,λ)p(it∣it−1,λ)p(it−1,o1:t−1∣λ)=[it−1∑p(it−1,o1:t−1∣λ)p(it∣it−1,λ)]p(ot∣it,λ)
设:
α t + 1 ( i ) = p ( o 1 : t + 1 , i t + 1 = q i ∣ λ ) (8) \alpha_{t+1}(i) = p(o_{1:t+1},i_{t+1}=q_i|\lambda) \tag8 αt+1(i)=p(o1:t+1,it+1=qi∣λ)(8)
且:
p ( i t + 1 = q i ∣ i t = q j , λ ) ] = a j i p(i_{t+1}=q_i|i_t=q_j,\lambda)] = a_{ji} p(it+1=qi∣it=qj,λ)]=aji
p ( o t + 1 ∣ i t + 1 , λ ) = b i ( o t + 1 ) p(o_{t+1}|i_{t+1},\lambda)=b_i(o_{t+1}) p(ot+1∣it+1,λ)=bi(ot+1)
则:
α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) (9) \alpha_{t+1}(i)=[\sum_{j=1}^N \alpha_t(j)a_{ji}]b_i(o_{t+1}) \tag9 αt+1(i)=[j=1∑Nαt(j)aji]bi(ot+1)(9)
所以前向算法就可迭代进行。
前向算法:
1.初值α 1 ( i ) = π i b i ( o 1 ) \alpha_1(i) = \pi_ib_i(o_1) α1(i)=πibi(o1)
2.递推 t = 1 , 2 , . . . , T − 1 t=1,2,...,T-1 t=1,2,...,T−1
α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) \alpha_{t+1}(i)=[\sum_{j=1}^N \alpha_t(j)a_{ji}]b_i(o_{t+1}) αt+1(i)=[j=1∑Nαt(j)aji]bi(ot+1)
3.终止
p ( O ∣ λ ) = ∑ i = 1 N α T ( i ) p(O|\lambda) = \sum_{i=1}^N \alpha_T(i) p(O∣λ)=i=1∑NαT(i)
后向算法解决后向概率 p ( o t + 1 : T ∣ i t , λ ) p(o_{t+1:T}|i_t, \lambda) p(ot+1:T∣it,λ):
p ( o t + 1 : T ∣ i t , λ ) = ∑ i t + 1 p ( i t + 1 , o t + 1 , o t + 2 : T ∣ i t , λ ) = ∑ i t + 1 p ( o t + 2 : T ∣ i t + 1 , i t , o t + 1 , λ ) p ( o t + 1 ∣ i t + 1 , i t , λ ) p ( i t + 1 ∣ i t , λ ) \begin{aligned} p(o_{t+1:T}|i_t, \lambda) &= \sum_{i_{t+1}} p(i_{t+1},o_{t+1},o_{t+2:T} | i_t, \lambda) \\ &= \sum_{i_{t+1}} p(o_{t+2:T}|i_{t+1}, i_t, o_{t+1}, \lambda) p(o_{t+1}|i_{t+1}, i_t, \lambda) p(i_{t+1}|i_t,\lambda)\\ \end{aligned} p(ot+1:T∣it,λ)=it+1∑p(it+1,ot+1,ot+2:T∣it,λ)=it+1∑p(ot+2:T∣it+1,it,ot+1,λ)p(ot+1∣it+1,it,λ)p(it+1∣it,λ)
由隐马尔科夫的条件独立假设得:
p ( o t + 2 : T ∣ i t + 1 , i t , o t + 1 , λ ) = p ( o t + 2 : T ∣ i t + 1 , λ ) p(o_{t+2:T}|i_{t+1}, i_t, o_{t+1}, \lambda)=p(o_{t+2:T}|i_{t+1}, \lambda) p(ot+2:T∣it+1,it,ot+1,λ)=p(ot+2:T∣it+1,λ)
p ( o t + 1 ∣ i t + 1 , i t , λ ) = p ( o t + 1 ∣ i t + 1 , λ ) p(o_{t+1}|i_{t+1}, i_t, \lambda) = p(o_{t+1}|i_{t+1}, \lambda) p(ot+1∣it+1,it,λ)=p(ot+1∣it+1,λ)
设:
β t ( i ) = p ( o t + 1 : T ∣ i t = q i , λ ) (10) \beta_t(i) = p(o_{t+1:T}|i_t=q_i, \lambda) \tag{10} βt(i)=p(ot+1:T∣it=qi,λ)(10)
又:
p ( i t + 1 = q j ∣ i t = q i , λ ) = a i j p(i_{t+1}=q_j|i_t=q_i,\lambda) = a_{ij} p(it+1=qj∣it=qi,λ)=aij
p ( o t + 1 ∣ i t + 1 = q j , λ ) = b j ( o t + 1 ) p(o_{t+1}|i_{t+1}=q_j, \lambda) = b_j(o_{t+1}) p(ot+1∣it+1=qj,λ)=bj(ot+1)
则:
β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( i ) (11) \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(i) \tag{11} βt(i)=j=1∑Naijbj(ot+1)βt+1(i)(11)
后向算法:
(1)β T ( i ) = 1 \beta_T (i) = 1 βT(i)=1
(2) 对t=T-1,T-2,…,1
β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( i ) \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(i) βt(i)=j=1∑Naijbj(ot+1)βt+1(i)
(3)
p ( O ∣ λ ) = ∑ i = 1 N π i b i ( o 1 ) β 1 ( i ) p(O|\lambda) = \sum_{i=1}^N \pi_i b_i(o_1) \beta_1(i) p(O∣λ)=i=1∑Nπibi(o1)β1(i)
这两个期望值都是后面EM算法用到的中间参量
1.计算 t t t时刻处于状态 q i q_i qi的概率。
概率计算问题是计算 p ( O ∣ λ ) p(O|\lambda) p(O∣λ),则有:
p ( O ∣ λ ) = ∑ i t p ( O , i t ∣ λ ) p(O|\lambda)=\sum_{i_t}p(O,i_t|\lambda) p(O∣λ)=it∑p(O,it∣λ)
依据隐马尔科夫的独立性假设:
p ( o t + 1 : T ∣ i t , o 1 : t , λ ) = p ( o t + 1 : T ∣ i t , λ ) p(o_{t+1:T}|i_t,o_{1:t}, \lambda) = p(o_{t+1:T}|i_t, \lambda) p(ot+1:T∣it,o1:t,λ)=p(ot+1:T∣it,λ)
所以:
p ( O ∣ λ ) = ∑ i t p ( O , i t ∣ λ ) = ∑ i t p ( o t + 1 : T ∣ i t , o 1 : t , λ ) p ( i t , o 1 : t ∣ λ ) = ∑ i t p ( o t + 1 : T ∣ i t , λ ) p ( i t , o 1 : t ∣ λ ) \begin{aligned} p(O|\lambda) &=\sum_{i_t}p(O,i_t|\lambda) \\ &=\sum_{i_t} p(o_{t+1:T}|i_t,o_{1:t}, \lambda) p(i_t,o_{1:t}|\lambda) \\ &=\sum_{i_t} p(o_{t+1:T}|i_t, \lambda) p(i_t,o_{1:t}|\lambda) \\ \end{aligned} p(O∣λ)=it∑p(O,it∣λ)=it∑p(ot+1:T∣it,o1:t,λ)p(it,o1:t∣λ)=it∑p(ot+1:T∣it,λ)p(it,o1:t∣λ)
又有:
α t ( i ) = p ( o 1 : t , i t = q i ∣ λ ) (12) \alpha_t(i) = p(o_{1:t},i_t=q_i|\lambda) \tag{12} αt(i)=p(o1:t,it=qi∣λ)(12)
β t ( i ) = p ( o t + 1 : T ∣ i t = q i , λ ) (13) \beta_t(i) = p(o_{t+1:T}|i_t=q_i, \lambda) \tag{13} βt(i)=p(ot+1:T∣it=qi,λ)(13)
故:
p ( O , i t = q i ∣ λ ) = p ( o t + 1 : T ∣ i t = q i , λ ) p ( i t = q i , o 1 : t ∣ λ ) = α t ( i ) β t ( i ) p(O,i_t=q_i|\lambda) = p(o_{t+1:T}|i_t=q_i, \lambda) p(i_t=q_i,o_{1:t}|\lambda) = \alpha_t(i) \beta_t(i) p(O,it=qi∣λ)=p(ot+1:T∣it=qi,λ)p(it=qi,o1:t∣λ)=αt(i)βt(i)
p ( O ∣ λ ) = ∑ i t α t ( i ) β t ( i ) p(O|\lambda) = \sum_{i_t} \alpha_t(i) \beta_t(i) p(O∣λ)=it∑αt(i)βt(i)
设:
γ t ( i ) = p ( i t = q i ∣ O , λ ) \gamma_t(i) = p(i_t=q_i|O,\lambda) γt(i)=p(it=qi∣O,λ)
于是可以得到:
γ t ( i ) = p ( i t = q i ∣ O , λ ) = p ( i t = q i , O ∣ λ ) p ( O ∣ λ ) = α t ( i ) β t ( i ) ∑ j = 1 N α t ( j ) β t ( j ) (14) \gamma_t(i) = p(i_t=q_i|O,\lambda) = \frac {p(i_t=q_i,O|\lambda)}{p(O|\lambda)} = \frac {\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)} \tag{14} γt(i)=p(it=qi∣O,λ)=p(O∣λ)p(it=qi,O∣λ)=∑j=1Nαt(j)βt(j)αt(i)βt(i)(14)
2.计算计算 t t t时刻处于状态 q i q_i qi且计算 t + 1 t+1 t+1时刻处于状态 q j q_j qj的概率
p ( O ∣ λ ) = ∑ i t ∑ i t + 1 p ( O , i t , i t + 1 ∣ λ ) = ∑ i t ∑ i t + 1 p ( o 1 : t , o t + 1 , o t + 2 : T , i t , i t + 1 ∣ λ ) = ∑ i t ∑ i t + 1 p ( o t + 2 : T ∣ o 1 : t , o t + 1 , i t , i t + 1 , λ ) p ( o t + 1 ∣ o 1 : t , i t , i t + 1 , λ ) p ( i t + 1 ∣ i t , o 1 : t , λ ) p ( i t , o 1 : t ∣ λ ) \begin{aligned} p(O|\lambda) &=\sum_{i_t} \sum_{i_{t+1}} p(O,i_t, i_{t+1}|\lambda) \\ &=\sum_{i_t} \sum_{i_{t+1}} p(o_{1:t},o_{t+1},o_{t+2:T},i_t, i_{t+1}|\lambda) \\ &=\sum_{i_t} \sum_{i_{t+1}} p(o_{t+2:T}|o_{1:t},o_{t+1},i_t, i_{t+1},\lambda)p(o_{t+1}|o_{1:t},i_t,i_{t+1},\lambda) p(i_{t+1}|i_t,o_{1:t},\lambda) p(i_t,o_{1:t}|\lambda) \\ \end{aligned} p(O∣λ)=it∑it+1∑p(O,it,it+1∣λ)=it∑it+1∑p(o1:t,ot+1,ot+2:T,it,it+1∣λ)=it∑it+1∑p(ot+2:T∣o1:t,ot+1,it,it+1,λ)p(ot+1∣o1:t,it,it+1,λ)p(it+1∣it,o1:t,λ)p(it,o1:t∣λ)
由隐马尔科夫的独立性假设可得:
p ( O ∣ λ ) = ∑ i t ∑ i t + 1 p ( o t + 2 : T ∣ i t + 1 , λ ) p ( o t + 1 ∣ i t + 1 , λ ) p ( i t + 1 ∣ i t , λ ) p ( i t , o 1 : t ∣ λ ) p(O|\lambda) = \sum_{i_t} \sum_{i_{t+1}} p(o_{t+2:T}| i_{t+1},\lambda)p(o_{t+1}|i_{t+1},\lambda) p(i_{t+1}|i_t,\lambda) p(i_t,o_{1:t}|\lambda) p(O∣λ)=it∑it+1∑p(ot+2:T∣it+1,λ)p(ot+1∣it+1,λ)p(it+1∣it,λ)p(it,o1:t∣λ)
设:
ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) \xi_t(i,j)=p(i_t=q_i,i_{t+1}=q_j|O,\lambda) ξt(i,j)=p(it=qi,it+1=qj∣O,λ)
又有公式(2)(4)(12)(13)
得:
ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) p ( O ∣ λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∑ i = 1 N ∑ j = 1 N α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) (15) \xi_t(i,j) = \frac {p(i_t=q_i,i_{t+1}=q_j|O,\lambda)}{p(O|\lambda)} =\frac {\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} {\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} \tag{15} ξt(i,j)=p(O∣λ)p(it=qi,it+1=qj∣O,λ)=∑i=1N∑j=1Nαt(i)aijbj(ot+1)βt+1(j)αt(i)aijbj(ot+1)βt+1(j)(15)
如果有标记好状态序列的样本,那就太好办了,直接将接个矩阵统计的各个维度定义后进行统计就可以了。统计过程中注意概率之和为一的约束。
如果没有标记状态序列的样本,可以用Baum-Welch算法(EM算法)实现。
已知:包含 S S S个长度为 T T T的观测序列的观测序列 { O 1 , O 2 , . . . , O S } \{O_1,O_2,...,O_S \} {O1,O2,...,OS}
目标:学习隐马尔可夫模型的参数 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π)
记观测数据 O O O,隐数据 I I I,那么隐马尔可夫模型可以表示为:
p ( O ∣ λ ) = ∑ I p ( O ∣ I , λ ) p ( I ∣ λ ) p(O|\lambda) = \sum_I p(O|I,\lambda) p(I|\lambda) p(O∣λ)=I∑p(O∣I,λ)p(I∣λ)
E步:
因为对 λ \lambda λ而言, 1 / p ( O ∣ λ ‾ ) 1/p(O| \overline \lambda) 1/p(O∣λ)是常数项,所以
Q ( λ , λ ‾ ) = E I [ log p ( O , I ∣ λ ) ∣ O , λ ‾ ] = ∑ I log p ( O , I ∣ λ ) p ( I ∣ O , λ ‾ ) = ∑ I log p ( O , I ∣ λ ) p ( I , O ∣ λ ‾ ) p ( O ∣ λ ‾ ) = ∑ I log p ( O , I ∣ λ ) p ( I , O ∣ λ ‾ ) \begin{aligned} Q(\lambda,\overline \lambda) &= E_I[\log p(O,I|\lambda)|O, \overline \lambda] \\ &= \sum_I \log p(O,I|\lambda) p(I|O,\overline \lambda) \\ &= \sum_I \log p(O,I|\lambda) \frac {p(I,O|\overline \lambda)}{p(O| \overline \lambda)} \\ &= \sum_I \log p(O,I|\lambda) p(I,O|\overline \lambda) \\ \end{aligned} Q(λ,λ)=EI[logp(O,I∣λ)∣O,λ]=I∑logp(O,I∣λ)p(I∣O,λ)=I∑logp(O,I∣λ)p(O∣λ)p(I,O∣λ)=I∑logp(O,I∣λ)p(I,O∣λ)
将概率计算问题2.1小姐中前向算法的递归公式展开就可以得到:
p ( O , I ∣ λ ) = π i 1 b i 1 ( o 1 ) a i 1 i 2 b i 2 ( o 2 ) . . . a i T − 1 i T b i T ( o T ) = π i 1 [ ∏ t = 1 T − 1 a i t i t + 1 ] [ ∏ t = 1 T b i t ( o t ) ] p(O,I|\lambda) = \pi_{i_1} b_{i_1}(o_1) a_{i_1i_2} b_{i_2}(o_2) ... a_{i_{T-1}i_T} b_{iT}(o_T) = \pi_{i_1} [\prod_{t=1}^{T-1} a_{i_ti_{t+1}}][\prod_{t=1}^T b_{it}(o_t)] p(O,I∣λ)=πi1bi1(o1)ai1i2bi2(o2)...aiT−1iTbiT(oT)=πi1[t=1∏T−1aitit+1][t=1∏Tbit(ot)]
于是:
Q ( λ , λ ‾ ) = ∑ I log π i 1 p ( O , I ∣ λ ‾ ) + ∑ I ( ∑ t = 1 T − 1 a i t i t + 1 ) p ( O , I ∣ λ ‾ ) + ∑ I ( ∑ t = 1 T b i t ( o t ) ) p ( O , I ∣ λ ‾ ) (16) Q(\lambda, \overline \lambda) = \sum_I \log \pi_{i_1} p(O, I| \overline \lambda) + \sum_I (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) p(O, I| \overline \lambda) + \sum_I (\sum_{t=1}^T b_{it}(o_t)) p(O, I| \overline \lambda) \tag{16} Q(λ,λ)=I∑logπi1p(O,I∣λ)+I∑(t=1∑T−1aitit+1)p(O,I∣λ)+I∑(t=1∑Tbit(ot))p(O,I∣λ)(16)
特此说明隐变量
隐马尔可夫模型的隐变量就是观测序列对应的状态序列,所以隐变量可以用(14)式的变量表示
后面在M步中更新模型参数的时候也用到了(15)式,是不是就说明隐变量是两个,其实不是的,这儿只是为了表示的方便和算法的方便。
也就是在E步中,用 γ \gamma γ和 ξ \xi ξ表示隐变量,只是为了编程和表示的便利,这两个变量在E步中信息是重复的。
M步:
1.求解 π i \pi_i πi
由(15)式可得:
L ( π i 1 ) = ∑ I log π i 1 p ( O , I ∣ λ ‾ ) = ∑ i N log π i 1 p ( O , i 1 = i ∣ λ ‾ ) L(\pi_{i_1}) = \sum_I \log \pi_{i_1} p(O, I| \overline \lambda) = \sum_{i}^N \log \pi_{i_1} p(O, i_1=i| \overline \lambda) L(πi1)=I∑logπi1p(O,I∣λ)=i∑Nlogπi1p(O,i1=i∣λ)
又因为 π i \pi_i πi满足约束条件 ∑ i = 1 N π i 1 = 1 \sum_{i=1}^N \pi_{i_1}=1 ∑i=1Nπi1=1,利用拉格朗日乘子法,写出拉格朗日函数:
∑ i = 1 N log π i p ( O , i 1 = i ∣ λ ‾ ) + γ ( ∑ i = 1 N π i − 1 ) \sum_{i=1}^N \log \pi_{i} p(O, i_1=i| \overline \lambda) + \gamma(\sum_{i=1}^N \pi_{i} - 1) i=1∑Nlogπip(O,i1=i∣λ)+γ(i=1∑Nπi−1)
对其求偏导并且令其结果为0得:
∂ ∂ π i [ ∑ i = 1 N log π i p ( O , i = i ∣ λ ‾ ) + γ ( ∑ i 1 = 1 N π i − 1 ) ] = 0 (17) \frac {\partial} {\partial \pi_i} [\sum_{i=1}^N \log \pi_{i} p(O, i=i| \overline \lambda) + \gamma(\sum_{i_1=1}^N \pi_{i} - 1)]=0 \tag{17} ∂πi∂[i=1∑Nlogπip(O,i=i∣λ)+γ(i1=1∑Nπi−1)]=0(17)
得:
p ( O , i 1 = i ∣ λ ‾ ) + γ π i = 0 p(O, i_1=i| \overline \lambda) + \gamma \pi_i=0 p(O,i1=i∣λ)+γπi=0
得到:
π i = p ( O , i 1 = i ∣ λ ‾ ) − λ \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {-\lambda} πi=−λp(O,i1=i∣λ)
带入 ∑ i = 1 N π i 1 = 1 \sum_{i=1}^N \pi_{i_1}=1 ∑i=1Nπi1=1的:
− λ = ∑ i = 1 N p ( O , i 1 = i ∣ λ ‾ ) = p ( o ∣ λ ‾ ) -\lambda = \sum_{i=1}^N p(O, i_1=i| \overline \lambda) = p(o|\overline \lambda) −λ=i=1∑Np(O,i1=i∣λ)=p(o∣λ)
求得并有公式(14):
π i = p ( O , i 1 = i ∣ λ ‾ ) p ( o ∣ λ ‾ ) = γ 1 ( i ) (18) \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {p(o|\overline \lambda)} = \gamma_1(i) \tag{18} πi=p(o∣λ)p(O,i1=i∣λ)=γ1(i)(18)
2.求解 a i j a_{ij} aij:
L ( a i j ) = ∑ I ( ∑ t = 1 T − 1 a i t i t + 1 ) p ( O , I ∣ λ ‾ ) = ∑ i = 1 N ( ∑ t = 1 T − 1 a i t i t + 1 ) ( ∑ j = 1 N p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) ) = ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) L(a_{ij})=\sum_I (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) p(O, I| \overline \lambda) = \sum_{i=1}^N (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) ( \sum_{j=1}^N p(O, i_t=i, i_{t+1}=j| \overline \lambda) ) \\ = \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) L(aij)=I∑(t=1∑T−1aitit+1)p(O,I∣λ)=i=1∑N(t=1∑T−1aitit+1)(j=1∑Np(O,it=i,it+1=j∣λ))=i=1∑Nj=1∑Nt=1∑T−1aijp(O,it=i,it+1=j∣λ)
应用约束条件 ∑ j = 1 N a i j = 1 \sum_{j=1}^N a_{ij} = 1 ∑j=1Naij=1,用拉格朗日乘子法可以求出:
∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ ( ∑ j = 1 N a i j − 1 ) \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda(\sum_{j=1}^N a_{ij} - 1) i=1∑Nj=1∑Nt=1∑T−1aijp(O,it=i,it+1=j∣λ)+λ(j=1∑Naij−1)
对上式求骗到并等于0得到:
∂ ∂ a i j [ ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ ( ∑ j = 1 N a i j − 1 ) ] = 0 \frac {\partial}{\partial a_{ij}} [\sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda(\sum_{j=1}^N a_{ij} - 1)] = 0 ∂aij∂[i=1∑Nj=1∑Nt=1∑T−1aijp(O,it=i,it+1=j∣λ)+λ(j=1∑Naij−1)]=0
得到:
∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ a i j = 0 \sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda a_{ij} = 0 t=1∑T−1p(O,it=i,it+1=j∣λ)+λaij=0
所以:
a i j = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) − λ a_{ij} = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda)}{- \lambda} aij=−λ∑t=1T−1p(O,it=i,it+1=j∣λ)
将上式带入 ∑ j = 1 N a i j = 1 \sum_{j=1}^N a_{ij} = 1 ∑j=1Naij=1:
− λ = ∑ j = 1 N ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) = ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) - \lambda = \sum_{j=1}^N \sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) = \sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) −λ=j=1∑Nt=1∑T−1p(O,it=i,it+1=j∣λ)=t=1∑T−1p(O,it=i∣λ)
故得:
a i j = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) / p ( o ∣ λ ‾ ) ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) / p ( o ∣ λ ‾ ) a_{ij} = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda)}{\sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) } = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) / p(o|\overline \lambda)} {\sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) / p(o|\overline \lambda) } aij=∑t=1T−1p(O,it=i∣λ)∑t=1T−1p(O,it=i,it+1=j∣λ)=∑t=1T−1p(O,it=i∣λ)/p(o∣λ)∑t=1T−1p(O,it=i,it+1=j∣λ)/p(o∣λ)
将(14)和(15)带入的:
a i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) (19) a_{ij} = \frac {\sum_{t=1}^{T-1} \xi_t(i,j)} {\sum_{t=1}^{T-1} \gamma_t(i) } \tag{19} aij=∑t=1T−1γt(i)∑t=1T−1ξt(i,j)(19)
3.求解 b j k b_j{k} bjk:
L ( b j k ) = ∑ I ( ∑ t = 1 T b i t ( o t ) ) p ( O , I ∣ λ ‾ ) = ∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) L(b_j{k}) = \sum_I (\sum_{t=1}^T b_{it}(o_t)) p(O, I| \overline \lambda) = \sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) L(bjk)=I∑(t=1∑Tbit(ot))p(O,I∣λ)=j=1∑Nt=1∑Tbj(ot)p(O,it=j∣λ)
在约束条件 ∑ k = 1 M b j ( k ) = 1 \sum_{k=1}^M b_j(k) = 1 ∑k=1Mbj(k)=1的拉格朗日乘子法:
∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) + λ ( ∑ k = 1 M b j ( k ) − 1 ) \sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) + \lambda(\sum_{k=1}^M b_j(k) - 1) j=1∑Nt=1∑Tbj(ot)p(O,it=j∣λ)+λ(k=1∑Mbj(k)−1)
对其求偏导得:
∂ ∂ b j ( k ) [ ∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) + λ ( ∑ k = 1 M b j ( k ) − 1 ) ] = 0 \frac {\partial}{\partial b_j(k)} [\sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) + \lambda(\sum_{k=1}^M b_j(k) - 1)] = 0 ∂bj(k)∂[j=1∑Nt=1∑Tbj(ot)p(O,it=j∣λ)+λ(k=1∑Mbj(k)−1)]=0
因为只有在 o t = v k o_t=v_k ot=vk时偏导才不会等于0,以 I ( o t = v k ) I(o_t=v_k) I(ot=vk)表示,则:
∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) + λ b j ( o t ) I ( o t = v k ) = 0 \sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k) + \lambda b_{j}(o_t)I(o_t=v_k) = 0 t=1∑Tp(O,it=j∣λ)I(ot=vk)+λbj(ot)I(ot=vk)=0
b j ( o t ) I ( o t = v k ) b_{j}(o_t)I(o_t=v_k) bj(ot)I(ot=vk)可以写作 b j ( k ) b_{j}(k) bj(k),故:
b j ( k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) − λ b_{j}(k) = \frac {\sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k)} {- \lambda} bj(k)=−λ∑t=1Tp(O,it=j∣λ)I(ot=vk)
将上式带入 ∑ k = 1 M b j ( k ) = 1 \sum_{k=1}^M b_j(k) = 1 ∑k=1Mbj(k)=1得:
− λ = ∑ k = 1 M ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) - \lambda = \sum_{k=1}^M \sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k) = \sum_{t=1}^T p(O, i_t=j| \overline \lambda) −λ=k=1∑Mt=1∑Tp(O,it=j∣λ)I(ot=vk)=t=1∑Tp(O,it=j∣λ)
得到:
b j ( k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) b_{j}(k) = \frac {\sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k)} {\sum_{t=1}^T p(O, i_t=j| \overline \lambda)} bj(k)=∑t=1Tp(O,it=j∣λ)∑t=1Tp(O,it=j∣λ)I(ot=vk)
又有(14)式可得:
b j ( k ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) (20) b_{j}(k) = \frac {\sum_{t=1,o_t=v_k}^T \gamma_t(j)} {\sum_{t=1}^T \gamma_t(j)} \tag{20} bj(k)=∑t=1Tγt(j)∑t=1,ot=vkTγt(j)(20)
EM算法总结:
E步:γ t ( i ) = p ( i t = q i ∣ O , λ ) = p ( i t = q i , O ∣ λ ) p ( O ∣ λ ) = α t ( i ) β t ( i ) ∑ j = 1 N α t ( j ) β t ( j ) \gamma_t(i) = p(i_t=q_i|O,\lambda) = \frac {p(i_t=q_i,O|\lambda)}{p(O|\lambda)} = \frac {\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)} γt(i)=p(it=qi∣O,λ)=p(O∣λ)p(it=qi,O∣λ)=∑j=1Nαt(j)βt(j)αt(i)βt(i)
ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) p ( O ∣ λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∑ i = 1 N ∑ j = 1 N α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) \xi_t(i,j) = \frac {p(i_t=q_i,i_{t+1}=q_j|O,\lambda)}{p(O|\lambda)} =\frac {\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} {\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} ξt(i,j)=p(O∣λ)p(it=qi,it+1=qj∣O,λ)=∑i=1N∑j=1Nαt(i)aijbj(ot+1)βt+1(j)αt(i)aijbj(ot+1)βt+1(j)
M步:
π i = p ( O , i 1 = i ∣ λ ‾ ) p ( o ∣ λ ‾ ) = γ 1 ( i ) \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {p(o|\overline \lambda)} = \gamma_1(i) πi=p(o∣λ)p(O,i1=i∣λ)=γ1(i)a i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) a_{ij} = \frac {\sum_{t=1}^{T-1} \xi_t(i,j)} {\sum_{t=1}^{T-1} \gamma_t(i) } aij=∑t=1T−1γt(i)∑t=1T−1ξt(i,j)
b j ( k ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) b_{j}(k) = \frac {\sum_{t=1,o_t=v_k}^T \gamma_t(j)} {\sum_{t=1}^T \gamma_t(j)} bj(k)=∑t=1Tγt(j)∑t=1,ot=vkTγt(j)
用维特比算法进行求解:
已知:模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π)和 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT)
求:条件概率最大 p ( I ∣ O , λ ) p(I|O,\lambda) p(I∣O,λ)最大的状态序列 I = ( i 1 , i 2 , . . . , i T ) I=(i_1,i_2,...,i_T) I=(i1,i2,...,iT)
因为 p ( O ) p(O) p(O)是一个定值,所以:
max I p ( I ∣ O , λ ) = max I p ( I , O ∣ λ ) / p ( O ∣ λ ) = max I p ( I , O ∣ λ ) \max_I p(I|O,\lambda) = \max_I p(I, O|\lambda) / p(O|\lambda) = \max_I p(I, O|\lambda) Imaxp(I∣O,λ)=Imaxp(I,O∣λ)/p(O∣λ)=Imaxp(I,O∣λ)
定义在时刻 t t t状态为 i i i的所有单个路径 ( i 1 , i 2 , . . . , i t ) (i_1,i_2,...,i_t) (i1,i2,...,it)中概率最大值为:
δ t ( i ) = max i 1 , i 2 , . . . , i t − 1 p ( i t = i , i t − 1 : i 1 , o t : 1 ∣ λ ) \delta_t(i) = \max_{i_1,i_2,...,i_{t-1}} p(i_t=i, i_{t-1:i_1},o_{t:1}|\lambda) δt(i)=i1,i2,...,it−1maxp(it=i,it−1:i1,ot:1∣λ)
递推推导:
p ( i t + 1 = i , i t : 1 , o t + 1 : 1 ∣ λ ) = p ( i t + 1 = i , i t , i t − 1 : 1 , o t + 1 , o t : 1 ∣ λ ) = p ( o t + 1 ∣ i t + 1 = i , i t , o t : 1 , λ ) p ( i t + 1 = i ∣ i t , i t − 1 : 1 , o t : 1 , λ ) p ( i t , i t − 1 : 1 , o t : 1 ∣ λ ) = p ( o t + 1 ∣ i t + 1 = i , λ ) p ( i t + 1 = i ∣ i t , λ ) p ( i t , i t − 1 : 1 , o t : 1 ∣ λ ) \begin{aligned} &p(i_{t+1}=i,i_{t:1},o_{t+1:1}| \lambda) \\ &=p(i_{t+1}=i,i_t,i_{t-1:1},o_{t+1},o_{t:1}| \lambda) \\ &= p(o_{t+1}|i_{t+1}=i,i_t,o_{t:1},\lambda) p(i_{t+1}=i|i_t,i_{t-1:1},o_{t:1}, \lambda) p(i_t,i_{t-1:1},o_{t:1}|\lambda) \\ &= p(o_{t+1}|i_{t+1}=i,\lambda) p(i_{t+1}=i|i_t,\lambda) p(i_t,i_{t-1:1},o_{t:1}|\lambda) \\ \end{aligned} p(it+1=i,it:1,ot+1:1∣λ)=p(it+1=i,it,it−1:1,ot+1,ot:1∣λ)=p(ot+1∣it+1=i,it,ot:1,λ)p(it+1=i∣it,it−1:1,ot:1,λ)p(it,it−1:1,ot:1∣λ)=p(ot+1∣it+1=i,λ)p(it+1=i∣it,λ)p(it,it−1:1,ot:1∣λ)
故:
δ t + 1 ( i ) = max i 1 , i 2 , . . . , i t − 1 p ( i t + 1 = i , i t : 1 , o t + 1 : 1 ∣ λ ) = max 1 ≤ j ≤ N [ δ t ( j ) a j i ] b i ( o t + 1 ) (21) \delta_{t+1}(i) = \max_{i_1,i_2,...,i_{t-1}} p(i_{t+1}=i,i_{t:1},o_{t+1:1}| \lambda) = \max_{1 \le j \le N} [\delta _t(j) a_{ji}] b_i(o_{t+1}) \tag{21} δt+1(i)=i1,i2,...,it−1maxp(it+1=i,it:1,ot+1:1∣λ)=1≤j≤Nmax[δt(j)aji]bi(ot+1)(21)
定义在时刻 t t t状态为 i i i的所有单个路径 ( i 1 , i 2 , . . . , i t − 1 ) (i_1,i_2,...,i_{t-1}) (i1,i2,...,it−1)中概率最大的第 t − 1 t-1 t−1个节点为:
ψ t ( i ) = arg max 1 ≤ j ≤ N [ δ t − 1 ( j ) a j i ] (22) \psi_t(i) = \arg \max_{1 \le j \le N}[\delta _{t-1}(j) a_{ji}] \tag{22} ψt(i)=arg1≤j≤Nmax[δt−1(j)aji](22)
下面说一下上面公式中出现的参数和下面模型之中的名称的对应关系(公式中的符号将会和代码一致):
:param N: N N N 表示状态数
:param M: M M M 表示观测数
:param V: V V V 表示观测集合,维度 ( M , ) (M,) (M,)
:param A: A A A 对应于状态转移矩阵,维度 ( N , N ) (N, N) (N,N)
:param B: B B B对应于观测概率矩阵(发射矩阵),维度 ( N , M ) (N, M) (N,M)
:param pi: π \pi π 对应于初始状态向量,维度 ( N , ) (N,) (N,)
:param S: S S S 表示输入句子数量
:param T: T T T 表示每个句子的个数
:param gamma: γ \gamma γ 隐变量,表示状态的概率矩阵,维度 ( S , N , T ) (S,N,T) (S,N,T)
:param xi: ξ \xi ξ 隐变量,表示状态的概率矩阵,维度 ( S , N , N , T ) (S,N,N,T) (S,N,N,T)
:param alpha: α \alpha α 前向算法结果,维度 ( N , T ) (N,T) (N,T)
:param beta: β \beta β 后向算法结果,维度 ( N , T ) (N,T) (N,T)
:param delta: δ \delta δ 维特比算法中存储概率最大值,维度 ( N , T ) (N,T) (N,T)
:param psi: ψ \psi ψ 维特比算法中存储概率最大值索引,维度 ( N , T ) (N,T) (N,T)
:param I: I I I 输出的状态向量,维度 ( T , ) (T,) (T,)
小技巧:为了避免连续乘积带来内存溢出,一般先用对数进行计算,最后再用指数运算还原。
logsumexp() # http://bayesjumping.net/log-sum-exp-trick/
log ∑ i exp ( x i ) = b + log ∑ i exp ( x i − b ) \log\sum_i \exp(x_i) = b + \log \sum_i \exp(x_i-b) logi∑exp(xi)=b+logi∑exp(xi−b)
def logSumExp(ns):
max = np.max(ns)
ds = ns - max
sumOfExp = np.exp(ds).sum()
return max + np.log(sumOfExp)
import numpy as np
class MyHMM(object):
def __init__(self, N=None, A=None, B=None, pi=None):
"""
HMM模型:
>隐马尔可夫的三个基本问题
>1.概率计算问题。给定模型$\lambda=(A,B,\pi)$和观测序列$O=(o_1,o_2,...,o_T)$,计算在已知模型参数的情况下,观测序列的概率,
即$p(O|\lambda)$。用前向算法或后向算法。
>2.学习问题。已知观测序列$O=(o_1,o_2,...,o_T)$,估计模型参数$\lambda=(A,B,\pi)$,使$p(O|\lambda)$最大。用BW算法。
>3.预测问题,也称解码问题。已知模型$\lambda=(A,B,\pi)$和$O=(o_1,o_2,...,o_T)$,求条件概率最大$p(I|O)$最大的状态序列
$I=(i_1,i_2,...,i_T)$。用维特比算法解码。
:param N: $N$ 表示状态数
:param M: $M$ 表示观测数
:param V: $V$ 表示观测集合,维度$(M,)$
:param A: $A$ 对应于状态转移矩阵,维度$(N, N)$
:param B: $B$对应于观测概率矩阵(发射矩阵),维度$(N, M)$
:param pi: $pi$ 对应于初始状态向量,维度$(N,)$
:param S: $S$ 表示输入句子数量
:param T: $T$ 表示每个句子的个数
:param gamma: $gamma$ 隐变量,表示状态的概率矩阵,维度$(S,N,T)$
:param xi: $xi$ 隐变量,表示状态的概率矩阵,维度$(S,N,N,T)$
:param alpha: $alpha$ 前向算法结果,维度$(N,T)$
:param beta: $beta$ 后向算法结果,维度$(N,T)$
:param delta: $delta$ 维特比算法中存储概率最大值,维度$(N,T)$
:param psi: $psi$ 维特比算法中存储概率最大值索引,维度$(N,T)$
:param I: $I$ 输出的状态向量,维度$(T,)$
"""
self.N = N # 状态数
self.params = {
'A': A,
'B': B,
'pi': pi,
'gamma': None,
'xi': None
}
self.M = None # 观测数
self.S = None # 句子个数
self.T = None # 每个句子的长度
self.V = None # 观测集合
self.eps = np.finfo(float).eps
np.random.seed(2)
def _init_params(self):
"""
初始化模型参数
:return:
"""
def generate_random_n_data(N):
ret = np.random.rand(N)
return ret / np.sum(ret)
pi = generate_random_n_data(self.N)
A = np.array([generate_random_n_data(self.N) for _ in range(self.N)])
B = np.array([generate_random_n_data(self.M) for _ in range(self.N)])
gamma = np.zeros((self.S, self.N, self.T))
xi = np.zeros((self.S, self.N, self.N, self.T))
self.params = {
'A': A,
'B': B,
'pi': pi,
'gamma': gamma,
'xi': xi
}
def logSumExp(self, ns, axis=None):
max = np.max(ns)
ds = ns - max
sumOfExp = np.exp(ds).sum(axis=axis)
return max + np.log(sumOfExp)
def _forward(self, O_s):
"""
前向算法,公式参考博客公式(9)
:param O_s: 单个序列,维度(N,)
:return:
"""
A = self.params['A']
B = self.params['B']
pi = self.params['pi']
T = len(O_s)
log_alpha = np.zeros((self.N, T))
for i in range(self.N):
log_alpha[i, 0] = np.log(pi[i] + self.eps) + np.log(B[i, O_s[0]])
for t in range(1, T):
for i in range(self.N):
log_alpha[i, t] = self.logSumExp(np.array([log_alpha[_i, t-1] +
np.log(A[_i, i] + self.eps) +
np.log(B[i, O_s[t]])
for _i in range(self.N)]))
return log_alpha
def _backward(self, O_s):
"""
后向算法,参考博客公式(11)
:param O_s: 单个序列,维度(N,)
:return:
"""
A = self.params['A']
B = self.params['B']
pi = self.params['pi']
T = len(O_s)
log_beta = np.zeros((self.N, T))
for i in range(self.N):
log_beta[i, T-1] = 0
for t in range(T-2, -1, -1):
for i in range(self.N):
log_beta[i, t] = self.logSumExp(np.array([
log_beta[_i, t+1] + np.log(A[i, _i] + self.eps) + np.log(B[_i, O_s[t+1]] + self.eps)
for _i in range(self.N)]))
return log_beta
def _E_step(self, O):
"""
BW算法的E_step
计算隐变量,参考博客公式(9)(11)
:param O:
:return:
"""
A = self.params['A']
B = self.params['B']
pi = self.params['pi']
# 对S个句子依次执行
for s in range(self.S):
O_s = O[s]
log_alpha = self._forward(O_s)
log_beta = self._backward(O_s)
# 前向算法得到的最大似然
log_likelihood = self.logSumExp(log_alpha[:, self.T -1]) # log p(O|lambda)
# # 后向算法得到的最大似然 (两个结果应该是相等的)
# log_likelihood = self.logSumExp(np.array([np.log(pi[_i] + self.eps) + np.log(B[_i, 0] + self.eps) + log_beta[_i, 0] for _i in range(self.N)]))
for i in range(self.N):
self.params['gamma'][s, i, self.T-1] = log_alpha[i, self.T-1] + log_beta[i, self.T-1] - log_likelihood
for t in range(self.T - 1):
for i in range(self.N):
self.params['gamma'][s, i, t] = log_alpha[i, t] + log_beta[i, t] - log_likelihood
for j in range(self.N):
self.params['xi'][s, i, j, t] = log_alpha[i, t] + np.log(A[i, j] + self.eps) + np.log(B[j, O_s[t + 1]] + self.eps) + log_beta[j, t+1] - log_likelihood
def _M_step(self, O):
"""
BW算法的M_step。参考博客公式(18)(19)(20)
:param O:
:return:
"""
gamma = self.params['gamma']
xi = self.params['xi']
count_gamma = np.zeros((self.S, self.N, self.M))
count_xi = np.zeros((self.S, self.N, self.N))
for s in range(self.S):
O_s = O[s, :]
for i in range(self.N):
for k in range(self.M):
if not (O_s == k).any():
count_gamma[s, i, k] = np.log(self.eps)
else:
count_gamma[s, i, k] = self.logSumExp(gamma[s, i, O_s == k])
for j in range(self.N):
count_xi[s, i, j] = self.logSumExp(xi[s, i, j, :])
self.params['pi'] = np.exp(self.logSumExp(gamma[:, :, 0], axis=0) - np.log(self.S + self.eps))
np.testing.assert_almost_equal(self.params['pi'].sum(), 1)
for i in range(self.N):
for k in range(self.M):
self.params['B'][i, k] = np.exp(self.logSumExp(count_gamma[:, i, k]) - self.logSumExp(
count_gamma[:, i, :]
))
for j in range(self.N):
self.params['A'][i, j] = np.exp(self.logSumExp(count_xi[:, i, j]) - self.logSumExp(
count_xi[:, i, :]
))
np.testing.assert_almost_equal(self.params['A'][i, :].sum(), 1)
np.testing.assert_almost_equal(self.params['B'][i, :].sum(), 1)
def fit(self, O, V=(0,1,2,3,4), max_iter=20):
O = np.array(O)
self.S, self.T = O.shape
self.M = len(V)
self.V = V
print(self.S, self.T)
self._init_params()
for i in range(max_iter):
self._E_step(O)
self._M_step(O)
def decode(self, O_s):
"""
用维特比算法解码。参考公式(21)(22)
:param O_s:
:return:
"""
O_s = np.array(O_s)
if len(O_s.shape) != 1:
raise ('只容许一个序列进行解码.')
T = len(O_s)
delta = np.zeros((self.N, self.T))
psi = np.zeros((self.N, self.T))
for i in range(self.N):
psi[i, 0] = 0
delta[i, 0] = np.log(self.params['pi'][i] + self.eps) + np.log(self.params['B'][i, O_s[0]])
for t in range(1, T):
for i in range(self.N):
seq_prob = [delta[j, t-1] + np.log(self.params['A'][j, i] + self.eps) + np.log(self.params['B'][i, O_s[t]]) for j in range(self.N)]
delta[i, t] = np.max(seq_prob)
psi[i, t] = np.argmax(seq_prob)
pointer = np.argmax(delta[:, -1])
I = [pointer]
for t in reversed(range(1, T)):
pointer = int(psi[int(pointer), t])
I.append(pointer)
I.reverse()
return I
def generate_data():
O = [['我', '看见', '猫'],
['猫', '是', '可爱的'],
['我', '是', '可爱的']]
word2index = {}
index2word = {}
for sentence in O:
for word in sentence:
if word not in word2index.keys():
word2index[word] = len(word2index)
index2word[len(index2word)] = word
print(word2index)
print(index2word)
O_input = []
for sentence in O:
O_input.append([word2index[word] for word in sentence])
print(O_input)
return O_input
def run_my_model():
O_input = generate_data()
N = 3 # 隐变量的维度设为3,表示有3种词性
my = MyHMM(N=N)
my.fit(O_input)
print('A:')
print(my.params['A'])
print('B:')
print(my.params['B'])
print('pi:')
print(my.params['pi'])
I = my.decode(O_s=(2, 1, 0))
print("I:")
print(I)
打印的结果为:
A:
[[0.33333528 0.33332093 0.33334378]
[0.43652988 0.25000742 0.3134627 ]
[0.2500279 0.4999721 0.25 ]]
B:
[[9.91856630e-17 1.12533719e-04 1.06467892e-01 3.70235380e-05 8.93382551e-01]
[7.40229993e-17 3.33285967e-01 1.91726675e-07 6.66712274e-01 1.56719927e-06]
[5.31681136e-01 1.48466699e-11 4.68318638e-01 1.95615241e-11 2.25912305e-07]]
pi:
[1.16786216e-17 3.50652002e-23 1.00000000e+00]
I:
[2, 1, 2]
参考资料:
《统计学习方法》李航著