机器学习之隐马尔科夫模型(HMM)原理及Python实现 (大章节)

HMM

隐马尔可夫模型(hidden Markov model, HMM)是可用于标注问题的统计学模型,是生成模型。

本章节内容参考李航博士的《统计学习方法》
本章节添加了一些结论性结果的推导过程。

1. 从一个自然语言处理的例子开始

例如有三个个句子:
句子一:我/名词 看见/动词 猫/名词
句子二:猫/名词 是/动词 可爱的/形容词
句子三:我/名词 是/动词 可爱的/形容词
一般只能观察到具体的词,所以像"我 看见 猫 …"是观测集合,而词性如"名词 动词 形容词 …"是状态序列

Q Q Q是所有可能的状态集合, V V V是所有可能的观测集合:

Q = { q 1 , q 2 , . . . , q N } , V = { v 1 , v 2 , . . . , v M } Q = \{q_1, q_2, ..., q_N\}, V=\{v_1, v_2, ..., v_M\} Q={q1,q2,...,qN},V={v1,v2,...,vM}

其中, N是可能的状态数,M是可能的观测数。

例如: Q = { 名 词 , 动 词 , 形 容 词 } , V = { 我 , 看 见 , 猫 , 是 , 可 爱 的 } , N = 3 , M = 5 Q=\{名词,动词,形容词 \},V=\{我, 看见, 猫, 是,可爱的\},N=3, M=5 Q={}V={}N=3,M=5

I I I是长度为 T T T的状态序列, O O O是对应的观测序列:

I = { i 1 , i 2 , . . . , i T } , O = { o 1 , o 2 , . . . , o T } I = \{i_1, i_2,..., i_T \}, O=\{o_1, o_2,..., o_T\} I={i1,i2,...,iT},O={o1,o2,...,oT}

例如: I = ( 名 词 , 动 词 , 名 词 ) , O = ( 我 , 看 见 , 猫 ) I=(名词,动词,名词), O=(我,看见,猫) I=()O=()

A A A是状态转移矩阵:

A = [ a i j ] N ∗ N (1) A=[a_{ij}]_{N*N} \tag1 A=[aij]NN(1)

其中,

a i j = p ( i t + 1 = q j ∣ i t = q i ) , i = 1 , 2 , . . . , N ; j = 1 , 2 , . . . , N (2) a_{ij} = p(i_{t+1}=q_j|i_t=q_i), i=1,2,...,N; j=1,2,...,N \tag2 aij=p(it+1=qjit=qi),i=1,2,...,N;j=1,2,...,N(2)

例如:

转态转移概率 名词 动词 形容词
名词 0 1 0
动词 1/3 0 2/3
形容词 1/3 1/3 1/3

B B B是观测概率矩阵,也就是发射矩阵:

B = [ b j ( k ) ] N ∗ M (3) B=[b_j(k)]_{N*M} \tag3 B=[bj(k)]NM(3)

其中,

b j ( k ) = p ( o t = v k ∣ i t = q j ) , k = 1 , 2 , . . . , M ; j = 1 , 2 , . . . , N (4) b_j(k) = p(o_t=v_k|i_t=q_j), k=1,2,...,M; j=1,2,...,N \tag4 bj(k)=p(ot=vkit=qj),k=1,2,...,M;j=1,2,...,N(4)

例如:

观测矩阵概率 看见 可爱的
名词 1 0 1 0 0
动词 0 1 0 1 0
形容词 0 0 0 0 1

π \pi π是初始状态概率向量:

π = ( π i ) (5) \pi = (\pi_i) \tag5 π=(πi)(5)

其中,

π i = p ( i 1 = q i ) , i = 1 , 2 , . . . , N (6) \pi_i = p(i_1 = q_i), i = 1,2,...,N \tag6 πi=p(i1=qi),i=1,2,...,N(6)

A , B A,B A,B π \pi π是HMM的参数,用 λ \lambda λ表示:

λ = ( A , B , π ) (7) \lambda = (A,B,\pi) \tag7 λ=(A,B,π)(7)

例如:

名词 动词 形容词
1 0 0

隐马尔可夫的三个基本问题
1.概率计算问题。给定模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π)和观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),计算在已知模型参数的情况下,观测序列的概率,即 p ( O ∣ λ ) p(O|\lambda) p(Oλ)
2.学习问题。已知观测序列 O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),估计模型参数 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π),使 p ( O ∣ λ ) p(O|\lambda) p(Oλ)最大。
3.预测问题,也称解码问题。已知模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π) O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT),求条件概率最大 p ( I ∣ O ) p(I|O) p(IO)最大的状态序列 I = ( i 1 , i 2 , . . . , i T ) I=(i_1,i_2,...,i_T) I=(i1,i2,...,iT)

2. 概率预测问题

概率问题预测用直接计算法,计算复杂度高,可以采用动态规划形式的前向和后向算法降低计算复杂度。
为了表示方便,记:

( o 1 : t ) = ( o 1 , o 2 , . . . , o n ) ; ( o t : T ) = ( o t , o t + 1 , . . . , o T ) (o_{1:t} )= (o_1,o_2,...,o_n); (o_{t_:T})=(o_t,o_{t+1},...,o_T) (o1:t)=(o1,o2,...,on);(ot:T)=(ot,ot+1,...,oT)

2.1 前向算法

接下来就是解前向概率 p ( i t , o 1 : t ∣ λ ) p(i_t,o_{1:t}|\lambda) p(it,o1:tλ)

p ( i t , o 1 : t ∣ λ ) = ∑ i t − 1 p ( i t − 1 , i t , o 1 : t − 1 , o t ∣ λ ) = ∑ i t − 1 p ( o t ∣ i t − 1 , i t , o 1 : t − 1 , λ ) p ( i t ∣ i t − 1 , o 1 : t − 1 , λ ) p ( i t − 1 , o 1 : t − 1 ∣ λ ) \begin{aligned} p(i_t,o_{1:t}|\lambda) &=\sum_{i_{t-1}} p(i_{t-1},i_t,o_{1:t-1},o_t|\lambda) \\ &=\sum_{i_{t-1}} p(o_t|i_{t-1},i_t,o_{1:t-1},\lambda)p(i_t|i_{t-1},o_{1:t-1},\lambda)p(i_{t-1},o_{1:t-1}|\lambda) \end{aligned} p(it,o1:tλ)=it1p(it1,it,o1:t1,otλ)=it1p(otit1,it,o1:t1,λ)p(itit1,o1:t1,λ)p(it1,o1:t1λ)

由隐马尔科夫的条件独立性假设可得:

p ( o t ∣ i t − 1 , i t , o 1 : t − 1 , λ ) = p ( o t ∣ i t , λ ) p(o_t|i_{t-1},i_t,o_{1:t-1},\lambda) = p(o_t|i_t,\lambda) p(otit1,it,o1:t1,λ)=p(otit,λ)

p ( i t ∣ i t − 1 , o 1 : t − 1 , λ ) = p ( i t ∣ i t − 1 , λ ) p(i_t|i_{t-1},o_{1:t-1},\lambda)=p(i_t|i_{t-1},\lambda) p(itit1,o1:t1,λ)=p(itit1,λ)

p ( i t , o 1 : t ∣ λ ) = ∑ i t − 1 p ( o t ∣ i t , λ ) p ( i t ∣ i t − 1 , λ ) p ( i t − 1 , o 1 : t − 1 ∣ λ ) = [ ∑ i t − 1 p ( i t − 1 , o 1 : t − 1 ∣ λ ) p ( i t ∣ i t − 1 , λ ) ] p ( o t ∣ i t , λ ) p(i_t,o_{1:t}|\lambda)=\sum_{i_{t-1}} p(o_t|i_t,\lambda) p(i_t|i_{t-1},\lambda)p(i_{t-1},o_{1:t-1}|\lambda)=[\sum_{i_{t-1} } p(i_{t-1},o_{1:t-1}|\lambda) p(i_t|i_{t-1},\lambda)] p(o_t|i_t,\lambda) p(it,o1:tλ)=it1p(otit,λ)p(itit1,λ)p(it1,o1:t1λ)=[it1p(it1,o1:t1λ)p(itit1,λ)]p(otit,λ)

设:

α t + 1 ( i ) = p ( o 1 : t + 1 , i t + 1 = q i ∣ λ ) (8) \alpha_{t+1}(i) = p(o_{1:t+1},i_{t+1}=q_i|\lambda) \tag8 αt+1(i)=p(o1:t+1,it+1=qiλ)(8)

且:

p ( i t + 1 = q i ∣ i t = q j , λ ) ] = a j i p(i_{t+1}=q_i|i_t=q_j,\lambda)] = a_{ji} p(it+1=qiit=qj,λ)]=aji

p ( o t + 1 ∣ i t + 1 , λ ) = b i ( o t + 1 ) p(o_{t+1}|i_{t+1},\lambda)=b_i(o_{t+1}) p(ot+1it+1,λ)=bi(ot+1)

则:

α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) (9) \alpha_{t+1}(i)=[\sum_{j=1}^N \alpha_t(j)a_{ji}]b_i(o_{t+1}) \tag9 αt+1(i)=[j=1Nαt(j)aji]bi(ot+1)(9)

所以前向算法就可迭代进行。

前向算法:
1.初值

α 1 ( i ) = π i b i ( o 1 ) \alpha_1(i) = \pi_ib_i(o_1) α1(i)=πibi(o1)

2.递推 t = 1 , 2 , . . . , T − 1 t=1,2,...,T-1 t=1,2,...,T1

α t + 1 ( i ) = [ ∑ j = 1 N α t ( j ) a j i ] b i ( o t + 1 ) \alpha_{t+1}(i)=[\sum_{j=1}^N \alpha_t(j)a_{ji}]b_i(o_{t+1}) αt+1(i)=[j=1Nαt(j)aji]bi(ot+1)

3.终止
p ( O ∣ λ ) = ∑ i = 1 N α T ( i ) p(O|\lambda) = \sum_{i=1}^N \alpha_T(i) p(Oλ)=i=1NαT(i)

2.2 后向算法

后向算法解决后向概率 p ( o t + 1 : T ∣ i t , λ ) p(o_{t+1:T}|i_t, \lambda) p(ot+1:Tit,λ):

p ( o t + 1 : T ∣ i t , λ ) = ∑ i t + 1 p ( i t + 1 , o t + 1 , o t + 2 : T ∣ i t , λ ) = ∑ i t + 1 p ( o t + 2 : T ∣ i t + 1 , i t , o t + 1 , λ ) p ( o t + 1 ∣ i t + 1 , i t , λ ) p ( i t + 1 ∣ i t , λ ) \begin{aligned} p(o_{t+1:T}|i_t, \lambda) &= \sum_{i_{t+1}} p(i_{t+1},o_{t+1},o_{t+2:T} | i_t, \lambda) \\ &= \sum_{i_{t+1}} p(o_{t+2:T}|i_{t+1}, i_t, o_{t+1}, \lambda) p(o_{t+1}|i_{t+1}, i_t, \lambda) p(i_{t+1}|i_t,\lambda)\\ \end{aligned} p(ot+1:Tit,λ)=it+1p(it+1,ot+1,ot+2:Tit,λ)=it+1p(ot+2:Tit+1,it,ot+1,λ)p(ot+1it+1,it,λ)p(it+1it,λ)

由隐马尔科夫的条件独立假设得:

p ( o t + 2 : T ∣ i t + 1 , i t , o t + 1 , λ ) = p ( o t + 2 : T ∣ i t + 1 , λ ) p(o_{t+2:T}|i_{t+1}, i_t, o_{t+1}, \lambda)=p(o_{t+2:T}|i_{t+1}, \lambda) p(ot+2:Tit+1,it,ot+1,λ)=p(ot+2:Tit+1,λ)

p ( o t + 1 ∣ i t + 1 , i t , λ ) = p ( o t + 1 ∣ i t + 1 , λ ) p(o_{t+1}|i_{t+1}, i_t, \lambda) = p(o_{t+1}|i_{t+1}, \lambda) p(ot+1it+1,it,λ)=p(ot+1it+1,λ)

设:

β t ( i ) = p ( o t + 1 : T ∣ i t = q i , λ ) (10) \beta_t(i) = p(o_{t+1:T}|i_t=q_i, \lambda) \tag{10} βt(i)=p(ot+1:Tit=qi,λ)(10)

又:

p ( i t + 1 = q j ∣ i t = q i , λ ) = a i j p(i_{t+1}=q_j|i_t=q_i,\lambda) = a_{ij} p(it+1=qjit=qi,λ)=aij

p ( o t + 1 ∣ i t + 1 = q j , λ ) = b j ( o t + 1 ) p(o_{t+1}|i_{t+1}=q_j, \lambda) = b_j(o_{t+1}) p(ot+1it+1=qj,λ)=bj(ot+1)

则:

β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( i ) (11) \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(i) \tag{11} βt(i)=j=1Naijbj(ot+1)βt+1(i)(11)

后向算法:
(1)

β T ( i ) = 1 \beta_T (i) = 1 βT(i)=1

(2) 对t=T-1,T-2,…,1

β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( i ) \beta_t(i) = \sum_{j=1}^N a_{ij} b_j(o_{t+1}) \beta_{t+1}(i) βt(i)=j=1Naijbj(ot+1)βt+1(i)

(3)

p ( O ∣ λ ) = ∑ i = 1 N π i b i ( o 1 ) β 1 ( i ) p(O|\lambda) = \sum_{i=1}^N \pi_i b_i(o_1) \beta_1(i) p(Oλ)=i=1Nπibi(o1)β1(i)

2.3 一些概率与期望值

这两个期望值都是后面EM算法用到的中间参量
1.计算 t t t时刻处于状态 q i q_i qi的概率。
概率计算问题是计算 p ( O ∣ λ ) p(O|\lambda) p(Oλ),则有:

p ( O ∣ λ ) = ∑ i t p ( O , i t ∣ λ ) p(O|\lambda)=\sum_{i_t}p(O,i_t|\lambda) p(Oλ)=itp(O,itλ)

依据隐马尔科夫的独立性假设:

p ( o t + 1 : T ∣ i t , o 1 : t , λ ) = p ( o t + 1 : T ∣ i t , λ ) p(o_{t+1:T}|i_t,o_{1:t}, \lambda) = p(o_{t+1:T}|i_t, \lambda) p(ot+1:Tit,o1:t,λ)=p(ot+1:Tit,λ)

所以:

p ( O ∣ λ ) = ∑ i t p ( O , i t ∣ λ ) = ∑ i t p ( o t + 1 : T ∣ i t , o 1 : t , λ ) p ( i t , o 1 : t ∣ λ ) = ∑ i t p ( o t + 1 : T ∣ i t , λ ) p ( i t , o 1 : t ∣ λ ) \begin{aligned} p(O|\lambda) &=\sum_{i_t}p(O,i_t|\lambda) \\ &=\sum_{i_t} p(o_{t+1:T}|i_t,o_{1:t}, \lambda) p(i_t,o_{1:t}|\lambda) \\ &=\sum_{i_t} p(o_{t+1:T}|i_t, \lambda) p(i_t,o_{1:t}|\lambda) \\ \end{aligned} p(Oλ)=itp(O,itλ)=itp(ot+1:Tit,o1:t,λ)p(it,o1:tλ)=itp(ot+1:Tit,λ)p(it,o1:tλ)

又有:

α t ( i ) = p ( o 1 : t , i t = q i ∣ λ ) (12) \alpha_t(i) = p(o_{1:t},i_t=q_i|\lambda) \tag{12} αt(i)=p(o1:t,it=qiλ)(12)

β t ( i ) = p ( o t + 1 : T ∣ i t = q i , λ ) (13) \beta_t(i) = p(o_{t+1:T}|i_t=q_i, \lambda) \tag{13} βt(i)=p(ot+1:Tit=qi,λ)(13)

故:

p ( O , i t = q i ∣ λ ) = p ( o t + 1 : T ∣ i t = q i , λ ) p ( i t = q i , o 1 : t ∣ λ ) = α t ( i ) β t ( i ) p(O,i_t=q_i|\lambda) = p(o_{t+1:T}|i_t=q_i, \lambda) p(i_t=q_i,o_{1:t}|\lambda) = \alpha_t(i) \beta_t(i) p(O,it=qiλ)=p(ot+1:Tit=qi,λ)p(it=qi,o1:tλ)=αt(i)βt(i)

p ( O ∣ λ ) = ∑ i t α t ( i ) β t ( i ) p(O|\lambda) = \sum_{i_t} \alpha_t(i) \beta_t(i) p(Oλ)=itαt(i)βt(i)

设:

γ t ( i ) = p ( i t = q i ∣ O , λ ) \gamma_t(i) = p(i_t=q_i|O,\lambda) γt(i)=p(it=qiO,λ)

于是可以得到:

γ t ( i ) = p ( i t = q i ∣ O , λ ) = p ( i t = q i , O ∣ λ ) p ( O ∣ λ ) = α t ( i ) β t ( i ) ∑ j = 1 N α t ( j ) β t ( j ) (14) \gamma_t(i) = p(i_t=q_i|O,\lambda) = \frac {p(i_t=q_i,O|\lambda)}{p(O|\lambda)} = \frac {\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)} \tag{14} γt(i)=p(it=qiO,λ)=p(Oλ)p(it=qi,Oλ)=j=1Nαt(j)βt(j)αt(i)βt(i)(14)

2.计算计算 t t t时刻处于状态 q i q_i qi且计算 t + 1 t+1 t+1时刻处于状态 q j q_j qj的概率

p ( O ∣ λ ) = ∑ i t ∑ i t + 1 p ( O , i t , i t + 1 ∣ λ ) = ∑ i t ∑ i t + 1 p ( o 1 : t , o t + 1 , o t + 2 : T , i t , i t + 1 ∣ λ ) = ∑ i t ∑ i t + 1 p ( o t + 2 : T ∣ o 1 : t , o t + 1 , i t , i t + 1 , λ ) p ( o t + 1 ∣ o 1 : t , i t , i t + 1 , λ ) p ( i t + 1 ∣ i t , o 1 : t , λ ) p ( i t , o 1 : t ∣ λ ) \begin{aligned} p(O|\lambda) &=\sum_{i_t} \sum_{i_{t+1}} p(O,i_t, i_{t+1}|\lambda) \\ &=\sum_{i_t} \sum_{i_{t+1}} p(o_{1:t},o_{t+1},o_{t+2:T},i_t, i_{t+1}|\lambda) \\ &=\sum_{i_t} \sum_{i_{t+1}} p(o_{t+2:T}|o_{1:t},o_{t+1},i_t, i_{t+1},\lambda)p(o_{t+1}|o_{1:t},i_t,i_{t+1},\lambda) p(i_{t+1}|i_t,o_{1:t},\lambda) p(i_t,o_{1:t}|\lambda) \\ \end{aligned} p(Oλ)=itit+1p(O,it,it+1λ)=itit+1p(o1:t,ot+1,ot+2:T,it,it+1λ)=itit+1p(ot+2:To1:t,ot+1,it,it+1,λ)p(ot+1o1:t,it,it+1,λ)p(it+1it,o1:t,λ)p(it,o1:tλ)

由隐马尔科夫的独立性假设可得:

p ( O ∣ λ ) = ∑ i t ∑ i t + 1 p ( o t + 2 : T ∣ i t + 1 , λ ) p ( o t + 1 ∣ i t + 1 , λ ) p ( i t + 1 ∣ i t , λ ) p ( i t , o 1 : t ∣ λ ) p(O|\lambda) = \sum_{i_t} \sum_{i_{t+1}} p(o_{t+2:T}| i_{t+1},\lambda)p(o_{t+1}|i_{t+1},\lambda) p(i_{t+1}|i_t,\lambda) p(i_t,o_{1:t}|\lambda) p(Oλ)=itit+1p(ot+2:Tit+1,λ)p(ot+1it+1,λ)p(it+1it,λ)p(it,o1:tλ)

设:

ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) \xi_t(i,j)=p(i_t=q_i,i_{t+1}=q_j|O,\lambda) ξt(i,j)=p(it=qi,it+1=qjO,λ)

又有公式(2)(4)(12)(13)

得:

ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) p ( O ∣ λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∑ i = 1 N ∑ j = 1 N α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) (15) \xi_t(i,j) = \frac {p(i_t=q_i,i_{t+1}=q_j|O,\lambda)}{p(O|\lambda)} =\frac {\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} {\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} \tag{15} ξt(i,j)=p(Oλ)p(it=qi,it+1=qjO,λ)=i=1Nj=1Nαt(i)aijbj(ot+1)βt+1(j)αt(i)aijbj(ot+1)βt+1(j)(15)

3. 学习问题
3.1 监督学习

如果有标记好状态序列的样本,那就太好办了,直接将接个矩阵统计的各个维度定义后进行统计就可以了。统计过程中注意概率之和为一的约束。

3.2 无监督学习

如果没有标记状态序列的样本,可以用Baum-Welch算法(EM算法)实现。

已知:包含 S S S个长度为 T T T的观测序列的观测序列 { O 1 , O 2 , . . . , O S } \{O_1,O_2,...,O_S \} {O1,O2,...,OS}
目标:学习隐马尔可夫模型的参数 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π)

记观测数据 O O O,隐数据 I I I,那么隐马尔可夫模型可以表示为:

p ( O ∣ λ ) = ∑ I p ( O ∣ I , λ ) p ( I ∣ λ ) p(O|\lambda) = \sum_I p(O|I,\lambda) p(I|\lambda) p(Oλ)=Ip(OI,λ)p(Iλ)

E步:

因为对 λ \lambda λ而言, 1 / p ( O ∣ λ ‾ ) 1/p(O| \overline \lambda) 1/p(Oλ)是常数项,所以

Q ( λ , λ ‾ ) = E I [ log ⁡ p ( O , I ∣ λ ) ∣ O , λ ‾ ] = ∑ I log ⁡ p ( O , I ∣ λ ) p ( I ∣ O , λ ‾ ) = ∑ I log ⁡ p ( O , I ∣ λ ) p ( I , O ∣ λ ‾ ) p ( O ∣ λ ‾ ) = ∑ I log ⁡ p ( O , I ∣ λ ) p ( I , O ∣ λ ‾ ) \begin{aligned} Q(\lambda,\overline \lambda) &= E_I[\log p(O,I|\lambda)|O, \overline \lambda] \\ &= \sum_I \log p(O,I|\lambda) p(I|O,\overline \lambda) \\ &= \sum_I \log p(O,I|\lambda) \frac {p(I,O|\overline \lambda)}{p(O| \overline \lambda)} \\ &= \sum_I \log p(O,I|\lambda) p(I,O|\overline \lambda) \\ \end{aligned} Q(λ,λ)=EI[logp(O,Iλ)O,λ]=Ilogp(O,Iλ)p(IO,λ)=Ilogp(O,Iλ)p(Oλ)p(I,Oλ)=Ilogp(O,Iλ)p(I,Oλ)

将概率计算问题2.1小姐中前向算法的递归公式展开就可以得到:

p ( O , I ∣ λ ) = π i 1 b i 1 ( o 1 ) a i 1 i 2 b i 2 ( o 2 ) . . . a i T − 1 i T b i T ( o T ) = π i 1 [ ∏ t = 1 T − 1 a i t i t + 1 ] [ ∏ t = 1 T b i t ( o t ) ] p(O,I|\lambda) = \pi_{i_1} b_{i_1}(o_1) a_{i_1i_2} b_{i_2}(o_2) ... a_{i_{T-1}i_T} b_{iT}(o_T) = \pi_{i_1} [\prod_{t=1}^{T-1} a_{i_ti_{t+1}}][\prod_{t=1}^T b_{it}(o_t)] p(O,Iλ)=πi1bi1(o1)ai1i2bi2(o2)...aiT1iTbiT(oT)=πi1[t=1T1aitit+1][t=1Tbit(ot)]

于是:

Q ( λ , λ ‾ ) = ∑ I log ⁡ π i 1 p ( O , I ∣ λ ‾ ) + ∑ I ( ∑ t = 1 T − 1 a i t i t + 1 ) p ( O , I ∣ λ ‾ ) + ∑ I ( ∑ t = 1 T b i t ( o t ) ) p ( O , I ∣ λ ‾ ) (16) Q(\lambda, \overline \lambda) = \sum_I \log \pi_{i_1} p(O, I| \overline \lambda) + \sum_I (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) p(O, I| \overline \lambda) + \sum_I (\sum_{t=1}^T b_{it}(o_t)) p(O, I| \overline \lambda) \tag{16} Q(λ,λ)=Ilogπi1p(O,Iλ)+I(t=1T1aitit+1)p(O,Iλ)+I(t=1Tbit(ot))p(O,Iλ)(16)

特此说明隐变量
隐马尔可夫模型的隐变量就是观测序列对应的状态序列,所以隐变量可以用(14)式的变量表示
后面在M步中更新模型参数的时候也用到了(15)式,是不是就说明隐变量是两个,其实不是的,这儿只是为了表示的方便和算法的方便。
也就是在E步中,用 γ \gamma γ ξ \xi ξ表示隐变量,只是为了编程和表示的便利,这两个变量在E步中信息是重复的。

M步:

1.求解 π i \pi_i πi
由(15)式可得:

L ( π i 1 ) = ∑ I log ⁡ π i 1 p ( O , I ∣ λ ‾ ) = ∑ i N log ⁡ π i 1 p ( O , i 1 = i ∣ λ ‾ ) L(\pi_{i_1}) = \sum_I \log \pi_{i_1} p(O, I| \overline \lambda) = \sum_{i}^N \log \pi_{i_1} p(O, i_1=i| \overline \lambda) L(πi1)=Ilogπi1p(O,Iλ)=iNlogπi1p(O,i1=iλ)

又因为 π i \pi_i πi满足约束条件 ∑ i = 1 N π i 1 = 1 \sum_{i=1}^N \pi_{i_1}=1 i=1Nπi1=1,利用拉格朗日乘子法,写出拉格朗日函数:

∑ i = 1 N log ⁡ π i p ( O , i 1 = i ∣ λ ‾ ) + γ ( ∑ i = 1 N π i − 1 ) \sum_{i=1}^N \log \pi_{i} p(O, i_1=i| \overline \lambda) + \gamma(\sum_{i=1}^N \pi_{i} - 1) i=1Nlogπip(O,i1=iλ)+γ(i=1Nπi1)

对其求偏导并且令其结果为0得:

∂ ∂ π i [ ∑ i = 1 N log ⁡ π i p ( O , i = i ∣ λ ‾ ) + γ ( ∑ i 1 = 1 N π i − 1 ) ] = 0 (17) \frac {\partial} {\partial \pi_i} [\sum_{i=1}^N \log \pi_{i} p(O, i=i| \overline \lambda) + \gamma(\sum_{i_1=1}^N \pi_{i} - 1)]=0 \tag{17} πi[i=1Nlogπip(O,i=iλ)+γ(i1=1Nπi1)]=0(17)

得:

p ( O , i 1 = i ∣ λ ‾ ) + γ π i = 0 p(O, i_1=i| \overline \lambda) + \gamma \pi_i=0 p(O,i1=iλ)+γπi=0

得到:

π i = p ( O , i 1 = i ∣ λ ‾ ) − λ \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {-\lambda} πi=λp(O,i1=iλ)

带入 ∑ i = 1 N π i 1 = 1 \sum_{i=1}^N \pi_{i_1}=1 i=1Nπi1=1的:

− λ = ∑ i = 1 N p ( O , i 1 = i ∣ λ ‾ ) = p ( o ∣ λ ‾ ) -\lambda = \sum_{i=1}^N p(O, i_1=i| \overline \lambda) = p(o|\overline \lambda) λ=i=1Np(O,i1=iλ)=p(oλ)

求得并有公式(14):

π i = p ( O , i 1 = i ∣ λ ‾ ) p ( o ∣ λ ‾ ) = γ 1 ( i ) (18) \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {p(o|\overline \lambda)} = \gamma_1(i) \tag{18} πi=p(oλ)p(O,i1=iλ)=γ1(i)(18)

2.求解 a i j a_{ij} aij:

L ( a i j ) = ∑ I ( ∑ t = 1 T − 1 a i t i t + 1 ) p ( O , I ∣ λ ‾ ) = ∑ i = 1 N ( ∑ t = 1 T − 1 a i t i t + 1 ) ( ∑ j = 1 N p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) ) = ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) L(a_{ij})=\sum_I (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) p(O, I| \overline \lambda) = \sum_{i=1}^N (\sum_{t=1}^{T-1} a_{i_ti_{t+1}}) ( \sum_{j=1}^N p(O, i_t=i, i_{t+1}=j| \overline \lambda) ) \\ = \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) L(aij)=I(t=1T1aitit+1)p(O,Iλ)=i=1N(t=1T1aitit+1)(j=1Np(O,it=i,it+1=jλ))=i=1Nj=1Nt=1T1aijp(O,it=i,it+1=jλ)

应用约束条件 ∑ j = 1 N a i j = 1 \sum_{j=1}^N a_{ij} = 1 j=1Naij=1,用拉格朗日乘子法可以求出:

∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ ( ∑ j = 1 N a i j − 1 ) \sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda(\sum_{j=1}^N a_{ij} - 1) i=1Nj=1Nt=1T1aijp(O,it=i,it+1=jλ)+λ(j=1Naij1)

对上式求骗到并等于0得到:

∂ ∂ a i j [ ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 a i j p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ ( ∑ j = 1 N a i j − 1 ) ] = 0 \frac {\partial}{\partial a_{ij}} [\sum_{i=1}^N \sum_{j=1}^N \sum_{t=1}^{T-1} a_{ij} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda(\sum_{j=1}^N a_{ij} - 1)] = 0 aij[i=1Nj=1Nt=1T1aijp(O,it=i,it+1=jλ)+λ(j=1Naij1)]=0

得到:

∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) + λ a i j = 0 \sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) + \lambda a_{ij} = 0 t=1T1p(O,it=i,it+1=jλ)+λaij=0

所以:

a i j = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) − λ a_{ij} = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda)}{- \lambda} aij=λt=1T1p(O,it=i,it+1=jλ)

将上式带入 ∑ j = 1 N a i j = 1 \sum_{j=1}^N a_{ij} = 1 j=1Naij=1

− λ = ∑ j = 1 N ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) = ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) - \lambda = \sum_{j=1}^N \sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) = \sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) λ=j=1Nt=1T1p(O,it=i,it+1=jλ)=t=1T1p(O,it=iλ)

故得:

a i j = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) = ∑ t = 1 T − 1 p ( O , i t = i , i t + 1 = j ∣ λ ‾ ) / p ( o ∣ λ ‾ ) ∑ t = 1 T − 1 p ( O , i t = i ∣ λ ‾ ) / p ( o ∣ λ ‾ ) a_{ij} = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda)}{\sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) } = \frac {\sum_{t=1}^{T-1} p(O, i_t=i, i_{t+1}=j| \overline \lambda) / p(o|\overline \lambda)} {\sum_{t=1}^{T-1} p(O, i_t=i| \overline \lambda) / p(o|\overline \lambda) } aij=t=1T1p(O,it=iλ)t=1T1p(O,it=i,it+1=jλ)=t=1T1p(O,it=iλ)/p(oλ)t=1T1p(O,it=i,it+1=jλ)/p(oλ)

将(14)和(15)带入的:

a i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) (19) a_{ij} = \frac {\sum_{t=1}^{T-1} \xi_t(i,j)} {\sum_{t=1}^{T-1} \gamma_t(i) } \tag{19} aij=t=1T1γt(i)t=1T1ξt(i,j)(19)

3.求解 b j k b_j{k} bjk:

L ( b j k ) = ∑ I ( ∑ t = 1 T b i t ( o t ) ) p ( O , I ∣ λ ‾ ) = ∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) L(b_j{k}) = \sum_I (\sum_{t=1}^T b_{it}(o_t)) p(O, I| \overline \lambda) = \sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) L(bjk)=I(t=1Tbit(ot))p(O,Iλ)=j=1Nt=1Tbj(ot)p(O,it=jλ)

在约束条件 ∑ k = 1 M b j ( k ) = 1 \sum_{k=1}^M b_j(k) = 1 k=1Mbj(k)=1的拉格朗日乘子法:

∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) + λ ( ∑ k = 1 M b j ( k ) − 1 ) \sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) + \lambda(\sum_{k=1}^M b_j(k) - 1) j=1Nt=1Tbj(ot)p(O,it=jλ)+λ(k=1Mbj(k)1)

对其求偏导得:

∂ ∂ b j ( k ) [ ∑ j = 1 N ∑ t = 1 T b j ( o t ) p ( O , i t = j ∣ λ ‾ ) + λ ( ∑ k = 1 M b j ( k ) − 1 ) ] = 0 \frac {\partial}{\partial b_j(k)} [\sum_{j=1}^N \sum_{t=1}^T b_{j}(o_t) p(O, i_t=j| \overline \lambda) + \lambda(\sum_{k=1}^M b_j(k) - 1)] = 0 bj(k)[j=1Nt=1Tbj(ot)p(O,it=jλ)+λ(k=1Mbj(k)1)]=0

因为只有在 o t = v k o_t=v_k ot=vk时偏导才不会等于0,以 I ( o t = v k ) I(o_t=v_k) I(ot=vk)表示,则:

∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) + λ b j ( o t ) I ( o t = v k ) = 0 \sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k) + \lambda b_{j}(o_t)I(o_t=v_k) = 0 t=1Tp(O,it=jλ)I(ot=vk)+λbj(ot)I(ot=vk)=0

b j ( o t ) I ( o t = v k ) b_{j}(o_t)I(o_t=v_k) bj(ot)I(ot=vk)可以写作 b j ( k ) b_{j}(k) bj(k),故:

b j ( k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) − λ b_{j}(k) = \frac {\sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k)} {- \lambda} bj(k)=λt=1Tp(O,it=jλ)I(ot=vk)

将上式带入 ∑ k = 1 M b j ( k ) = 1 \sum_{k=1}^M b_j(k) = 1 k=1Mbj(k)=1得:

− λ = ∑ k = 1 M ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) - \lambda = \sum_{k=1}^M \sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k) = \sum_{t=1}^T p(O, i_t=j| \overline \lambda) λ=k=1Mt=1Tp(O,it=jλ)I(ot=vk)=t=1Tp(O,it=jλ)

得到:

b j ( k ) = ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) I ( o t = v k ) ∑ t = 1 T p ( O , i t = j ∣ λ ‾ ) b_{j}(k) = \frac {\sum_{t=1}^T p(O, i_t=j| \overline \lambda) I(o_t=v_k)} {\sum_{t=1}^T p(O, i_t=j| \overline \lambda)} bj(k)=t=1Tp(O,it=jλ)t=1Tp(O,it=jλ)I(ot=vk)

又有(14)式可得:

b j ( k ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) (20) b_{j}(k) = \frac {\sum_{t=1,o_t=v_k}^T \gamma_t(j)} {\sum_{t=1}^T \gamma_t(j)} \tag{20} bj(k)=t=1Tγt(j)t=1,ot=vkTγt(j)(20)

EM算法总结:
E步:

γ t ( i ) = p ( i t = q i ∣ O , λ ) = p ( i t = q i , O ∣ λ ) p ( O ∣ λ ) = α t ( i ) β t ( i ) ∑ j = 1 N α t ( j ) β t ( j ) \gamma_t(i) = p(i_t=q_i|O,\lambda) = \frac {p(i_t=q_i,O|\lambda)}{p(O|\lambda)} = \frac {\alpha_t(i) \beta_t(i)}{\sum_{j=1}^N \alpha_t(j) \beta_t(j)} γt(i)=p(it=qiO,λ)=p(Oλ)p(it=qi,Oλ)=j=1Nαt(j)βt(j)αt(i)βt(i)

ξ t ( i , j ) = p ( i t = q i , i t + 1 = q j ∣ O , λ ) p ( O ∣ λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) ∑ i = 1 N ∑ j = 1 N α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) \xi_t(i,j) = \frac {p(i_t=q_i,i_{t+1}=q_j|O,\lambda)}{p(O|\lambda)} =\frac {\alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} {\sum_{i=1}^N \sum_{j=1}^N \alpha_t(i) a_{ij} b_j(o_{t+1}) \beta_{t+1}(j)} ξt(i,j)=p(Oλ)p(it=qi,it+1=qjO,λ)=i=1Nj=1Nαt(i)aijbj(ot+1)βt+1(j)αt(i)aijbj(ot+1)βt+1(j)

M步:
π i = p ( O , i 1 = i ∣ λ ‾ ) p ( o ∣ λ ‾ ) = γ 1 ( i ) \pi_i = \frac {p(O, i_1=i| \overline \lambda)} {p(o|\overline \lambda)} = \gamma_1(i) πi=p(oλ)p(O,i1=iλ)=γ1(i)

a i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 γ t ( i ) a_{ij} = \frac {\sum_{t=1}^{T-1} \xi_t(i,j)} {\sum_{t=1}^{T-1} \gamma_t(i) } aij=t=1T1γt(i)t=1T1ξt(i,j)

b j ( k ) = ∑ t = 1 , o t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) b_{j}(k) = \frac {\sum_{t=1,o_t=v_k}^T \gamma_t(j)} {\sum_{t=1}^T \gamma_t(j)} bj(k)=t=1Tγt(j)t=1,ot=vkTγt(j)

4. 预测问题(解码问题)

用维特比算法进行求解:
已知:模型 λ = ( A , B , π ) \lambda=(A,B,\pi) λ=(A,B,π) O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T) O=(o1,o2,...,oT)
求:条件概率最大 p ( I ∣ O , λ ) p(I|O,\lambda) p(IO,λ)最大的状态序列 I = ( i 1 , i 2 , . . . , i T ) I=(i_1,i_2,...,i_T) I=(i1,i2,...,iT)
因为 p ( O ) p(O) p(O)是一个定值,所以:

max ⁡ I p ( I ∣ O , λ ) = max ⁡ I p ( I , O ∣ λ ) / p ( O ∣ λ ) = max ⁡ I p ( I , O ∣ λ ) \max_I p(I|O,\lambda) = \max_I p(I, O|\lambda) / p(O|\lambda) = \max_I p(I, O|\lambda) Imaxp(IO,λ)=Imaxp(I,Oλ)/p(Oλ)=Imaxp(I,Oλ)

定义在时刻 t t t状态为 i i i的所有单个路径 ( i 1 , i 2 , . . . , i t ) (i_1,i_2,...,i_t) (i1,i2,...,it)中概率最大值为:

δ t ( i ) = max ⁡ i 1 , i 2 , . . . , i t − 1 p ( i t = i , i t − 1 : i 1 , o t : 1 ∣ λ ) \delta_t(i) = \max_{i_1,i_2,...,i_{t-1}} p(i_t=i, i_{t-1:i_1},o_{t:1}|\lambda) δt(i)=i1,i2,...,it1maxp(it=i,it1:i1,ot:1λ)

递推推导:

p ( i t + 1 = i , i t : 1 , o t + 1 : 1 ∣ λ ) = p ( i t + 1 = i , i t , i t − 1 : 1 , o t + 1 , o t : 1 ∣ λ ) = p ( o t + 1 ∣ i t + 1 = i , i t , o t : 1 , λ ) p ( i t + 1 = i ∣ i t , i t − 1 : 1 , o t : 1 , λ ) p ( i t , i t − 1 : 1 , o t : 1 ∣ λ ) = p ( o t + 1 ∣ i t + 1 = i , λ ) p ( i t + 1 = i ∣ i t , λ ) p ( i t , i t − 1 : 1 , o t : 1 ∣ λ ) \begin{aligned} &p(i_{t+1}=i,i_{t:1},o_{t+1:1}| \lambda) \\ &=p(i_{t+1}=i,i_t,i_{t-1:1},o_{t+1},o_{t:1}| \lambda) \\ &= p(o_{t+1}|i_{t+1}=i,i_t,o_{t:1},\lambda) p(i_{t+1}=i|i_t,i_{t-1:1},o_{t:1}, \lambda) p(i_t,i_{t-1:1},o_{t:1}|\lambda) \\ &= p(o_{t+1}|i_{t+1}=i,\lambda) p(i_{t+1}=i|i_t,\lambda) p(i_t,i_{t-1:1},o_{t:1}|\lambda) \\ \end{aligned} p(it+1=i,it:1,ot+1:1λ)=p(it+1=i,it,it1:1,ot+1,ot:1λ)=p(ot+1it+1=i,it,ot:1,λ)p(it+1=iit,it1:1,ot:1,λ)p(it,it1:1,ot:1λ)=p(ot+1it+1=i,λ)p(it+1=iit,λ)p(it,it1:1,ot:1λ)

故:

δ t + 1 ( i ) = max ⁡ i 1 , i 2 , . . . , i t − 1 p ( i t + 1 = i , i t : 1 , o t + 1 : 1 ∣ λ ) = max ⁡ 1 ≤ j ≤ N [ δ t ( j ) a j i ] b i ( o t + 1 ) (21) \delta_{t+1}(i) = \max_{i_1,i_2,...,i_{t-1}} p(i_{t+1}=i,i_{t:1},o_{t+1:1}| \lambda) = \max_{1 \le j \le N} [\delta _t(j) a_{ji}] b_i(o_{t+1}) \tag{21} δt+1(i)=i1,i2,...,it1maxp(it+1=i,it:1,ot+1:1λ)=1jNmax[δt(j)aji]bi(ot+1)(21)

定义在时刻 t t t状态为 i i i的所有单个路径 ( i 1 , i 2 , . . . , i t − 1 ) (i_1,i_2,...,i_{t-1}) (i1,i2,...,it1)中概率最大的第 t − 1 t-1 t1个节点为:

ψ t ( i ) = arg ⁡ max ⁡ 1 ≤ j ≤ N [ δ t − 1 ( j ) a j i ] (22) \psi_t(i) = \arg \max_{1 \le j \le N}[\delta _{t-1}(j) a_{ji}] \tag{22} ψt(i)=arg1jNmax[δt1(j)aji](22)

5. python实现模型
5.1 参数对应关系

下面说一下上面公式中出现的参数和下面模型之中的名称的对应关系(公式中的符号将会和代码一致):

:param N: N N N 表示状态数
:param M: M M M 表示观测数
:param V: V V V 表示观测集合,维度 ( M , ) (M,) (M,)
:param A: A A A 对应于状态转移矩阵,维度 ( N , N ) (N, N) (N,N)
:param B: B B B对应于观测概率矩阵(发射矩阵),维度 ( N , M ) (N, M) (N,M)
:param pi: π \pi π 对应于初始状态向量,维度 ( N , ) (N,) (N,)
:param S: S S S 表示输入句子数量
:param T: T T T 表示每个句子的个数
:param gamma: γ \gamma γ 隐变量,表示状态的概率矩阵,维度 ( S , N , T ) (S,N,T) (S,N,T)
:param xi: ξ \xi ξ 隐变量,表示状态的概率矩阵,维度 ( S , N , N , T ) (S,N,N,T) (S,N,N,T)
:param alpha: α \alpha α 前向算法结果,维度 ( N , T ) (N,T) (N,T)
:param beta: β \beta β 后向算法结果,维度 ( N , T ) (N,T) (N,T)
:param delta: δ \delta δ 维特比算法中存储概率最大值,维度 ( N , T ) (N,T) (N,T)
:param psi: ψ \psi ψ 维特比算法中存储概率最大值索引,维度 ( N , T ) (N,T) (N,T)
:param I: I I I 输出的状态向量,维度 ( T , ) (T,) (T,)

小技巧:为了避免连续乘积带来内存溢出,一般先用对数进行计算,最后再用指数运算还原。
logsumexp() # http://bayesjumping.net/log-sum-exp-trick/
log ⁡ ∑ i exp ⁡ ( x i ) = b + log ⁡ ∑ i exp ⁡ ( x i − b ) \log\sum_i \exp(x_i) = b + \log \sum_i \exp(x_i-b) logiexp(xi)=b+logiexp(xib)

def logSumExp(ns):
    max = np.max(ns)
    ds = ns - max
    sumOfExp = np.exp(ds).sum()
    return max + np.log(sumOfExp)
5.2 python实现HMM
import numpy as np


class MyHMM(object):
    def __init__(self, N=None, A=None, B=None, pi=None):
        """
        HMM模型:
        >隐马尔可夫的三个基本问题
        >1.概率计算问题。给定模型$\lambda=(A,B,\pi)$和观测序列$O=(o_1,o_2,...,o_T)$,计算在已知模型参数的情况下,观测序列的概率,
            即$p(O|\lambda)$。用前向算法或后向算法。
        >2.学习问题。已知观测序列$O=(o_1,o_2,...,o_T)$,估计模型参数$\lambda=(A,B,\pi)$,使$p(O|\lambda)$最大。用BW算法。
        >3.预测问题,也称解码问题。已知模型$\lambda=(A,B,\pi)$和$O=(o_1,o_2,...,o_T)$,求条件概率最大$p(I|O)$最大的状态序列
            $I=(i_1,i_2,...,i_T)$。用维特比算法解码。

        :param N: $N$ 表示状态数
        :param M: $M$ 表示观测数
        :param V: $V$ 表示观测集合,维度$(M,)$
        :param A: $A$ 对应于状态转移矩阵,维度$(N, N)$
        :param B: $B$对应于观测概率矩阵(发射矩阵),维度$(N, M)$
        :param pi: $pi$ 对应于初始状态向量,维度$(N,)$
        :param S: $S$ 表示输入句子数量
        :param T: $T$ 表示每个句子的个数
        :param gamma: $gamma$ 隐变量,表示状态的概率矩阵,维度$(S,N,T)$
        :param xi: $xi$ 隐变量,表示状态的概率矩阵,维度$(S,N,N,T)$
        :param alpha: $alpha$ 前向算法结果,维度$(N,T)$
        :param beta: $beta$ 后向算法结果,维度$(N,T)$
        :param delta: $delta$ 维特比算法中存储概率最大值,维度$(N,T)$
        :param psi: $psi$ 维特比算法中存储概率最大值索引,维度$(N,T)$
        :param I: $I$ 输出的状态向量,维度$(T,)$
        """
        self.N = N  # 状态数
        self.params = {
            'A': A,
            'B': B,
            'pi': pi,
            'gamma': None,
            'xi': None
        }

        self.M = None  # 观测数

        self.S = None  # 句子个数
        self.T = None  # 每个句子的长度

        self.V = None  # 观测集合

        self.eps = np.finfo(float).eps

        np.random.seed(2)

    def _init_params(self):
        """
        初始化模型参数
        :return:
        """
        def generate_random_n_data(N):
            ret = np.random.rand(N)
            return ret / np.sum(ret)

        pi = generate_random_n_data(self.N)
        A = np.array([generate_random_n_data(self.N) for _ in range(self.N)])
        B = np.array([generate_random_n_data(self.M) for _ in range(self.N)])

        gamma = np.zeros((self.S, self.N, self.T))
        xi = np.zeros((self.S, self.N, self.N, self.T))

        self.params = {
            'A': A,
            'B': B,
            'pi': pi,
            'gamma': gamma,
            'xi': xi
        }

    def logSumExp(self, ns, axis=None):
        max = np.max(ns)
        ds = ns - max
        sumOfExp = np.exp(ds).sum(axis=axis)
        return max + np.log(sumOfExp)

    def _forward(self, O_s):
        """
        前向算法,公式参考博客公式(9)
        :param O_s: 单个序列,维度(N,)
        :return:
        """
        A = self.params['A']
        B = self.params['B']
        pi = self.params['pi']
        T = len(O_s)

        log_alpha = np.zeros((self.N, T))

        for i in range(self.N):
            log_alpha[i, 0] = np.log(pi[i] + self.eps) + np.log(B[i, O_s[0]])

        for t in range(1, T):
            for i in range(self.N):
                log_alpha[i, t] = self.logSumExp(np.array([log_alpha[_i, t-1] +
                                                           np.log(A[_i, i] + self.eps) +
                                                           np.log(B[i, O_s[t]])
                                                           for _i in range(self.N)]))
        return log_alpha

    def _backward(self, O_s):
        """
        后向算法,参考博客公式(11)
        :param O_s: 单个序列,维度(N,)
        :return:
        """
        A = self.params['A']
        B = self.params['B']
        pi = self.params['pi']
        T = len(O_s)

        log_beta = np.zeros((self.N, T))

        for i in range(self.N):
            log_beta[i, T-1] = 0

        for t in range(T-2, -1, -1):
            for i in range(self.N):
                log_beta[i, t] = self.logSumExp(np.array([
                    log_beta[_i, t+1] + np.log(A[i, _i] + self.eps) + np.log(B[_i, O_s[t+1]] + self.eps)
                for _i in range(self.N)]))
        return log_beta

    def _E_step(self, O):
        """
        BW算法的E_step
        计算隐变量,参考博客公式(9)(11)
        :param O:
        :return:
        """
        A = self.params['A']
        B = self.params['B']
        pi = self.params['pi']
        # 对S个句子依次执行
        for s in range(self.S):
            O_s = O[s]
            log_alpha = self._forward(O_s)
            log_beta = self._backward(O_s)

            # 前向算法得到的最大似然
            log_likelihood = self.logSumExp(log_alpha[:, self.T -1])  # log p(O|lambda)
            # # 后向算法得到的最大似然 (两个结果应该是相等的)
            # log_likelihood = self.logSumExp(np.array([np.log(pi[_i] + self.eps) + np.log(B[_i, 0] + self.eps) + log_beta[_i, 0] for _i in range(self.N)]))

            for i in range(self.N):
                self.params['gamma'][s, i, self.T-1] = log_alpha[i, self.T-1] + log_beta[i, self.T-1] - log_likelihood

            for t in range(self.T - 1):
                for i in range(self.N):
                    self.params['gamma'][s, i, t] = log_alpha[i, t] + log_beta[i, t] - log_likelihood
                    for j in range(self.N):
                        self.params['xi'][s, i, j, t] = log_alpha[i, t] + np.log(A[i, j] + self.eps) + np.log(B[j, O_s[t + 1]] + self.eps) + log_beta[j, t+1] - log_likelihood

    def _M_step(self, O):
        """
        BW算法的M_step。参考博客公式(18)(19)(20)
        :param O:
        :return:
        """
        gamma = self.params['gamma']
        xi = self.params['xi']

        count_gamma = np.zeros((self.S, self.N, self.M))
        count_xi = np.zeros((self.S, self.N, self.N))

        for s in range(self.S):
            O_s = O[s, :]
            for i in range(self.N):
                for k in range(self.M):
                    if not (O_s == k).any():

                        count_gamma[s, i, k] = np.log(self.eps)
                    else:
                        count_gamma[s, i, k] = self.logSumExp(gamma[s, i, O_s == k])

                for j in range(self.N):
                    count_xi[s, i, j] = self.logSumExp(xi[s, i, j, :])

        self.params['pi'] = np.exp(self.logSumExp(gamma[:, :, 0], axis=0) - np.log(self.S + self.eps))
        np.testing.assert_almost_equal(self.params['pi'].sum(), 1)

        for i in range(self.N):
            for k in range(self.M):
                self.params['B'][i, k] = np.exp(self.logSumExp(count_gamma[:, i, k]) - self.logSumExp(
                    count_gamma[:, i, :]
                ))

            for j in range(self.N):
                self.params['A'][i, j] = np.exp(self.logSumExp(count_xi[:, i, j]) - self.logSumExp(
                    count_xi[:, i, :]
                ))

            np.testing.assert_almost_equal(self.params['A'][i, :].sum(), 1)
            np.testing.assert_almost_equal(self.params['B'][i, :].sum(), 1)

    def fit(self, O, V=(0,1,2,3,4), max_iter=20):
        O = np.array(O)
        self.S, self.T = O.shape
        self.M = len(V)
        self.V = V
        print(self.S, self.T)

        self._init_params()

        for i in range(max_iter):
            self._E_step(O)
            self._M_step(O)

    def decode(self, O_s):
        """
        用维特比算法解码。参考公式(21)(22)
        :param O_s:
        :return:
        """
        O_s = np.array(O_s)
        if len(O_s.shape) != 1:
            raise ('只容许一个序列进行解码.')

        T = len(O_s)

        delta = np.zeros((self.N, self.T))
        psi = np.zeros((self.N, self.T))

        for i in range(self.N):
            psi[i, 0] = 0
            delta[i, 0] = np.log(self.params['pi'][i] + self.eps) + np.log(self.params['B'][i, O_s[0]])

        for t in range(1, T):
            for i in range(self.N):
                seq_prob = [delta[j, t-1] + np.log(self.params['A'][j, i] + self.eps) + np.log(self.params['B'][i, O_s[t]]) for j in range(self.N)]
                delta[i, t] = np.max(seq_prob)
                psi[i, t] = np.argmax(seq_prob)

        pointer = np.argmax(delta[:, -1])
        I = [pointer]
        for t in reversed(range(1, T)):
            pointer = int(psi[int(pointer), t])
            I.append(pointer)

        I.reverse()
        return I
5.3 模型测试
def generate_data():
    O = [['我', '看见', '猫'],
         ['猫', '是', '可爱的'],
         ['我', '是', '可爱的']]
    word2index = {}
    index2word = {}
    for sentence in O:
        for word in sentence:
            if word not in word2index.keys():
                word2index[word] = len(word2index)
                index2word[len(index2word)] = word
    print(word2index)
    print(index2word)
    O_input = []
    for sentence in O:
        O_input.append([word2index[word] for word in sentence])
    print(O_input)
    return O_input


def run_my_model():
    O_input = generate_data()
    N = 3  # 隐变量的维度设为3,表示有3种词性
    my = MyHMM(N=N)
    my.fit(O_input)

    print('A:')
    print(my.params['A'])
    print('B:')
    print(my.params['B'])
    print('pi:')
    print(my.params['pi'])

    I = my.decode(O_s=(2, 1, 0))
    print("I:")
    print(I)

打印的结果为:
A:
[[0.33333528 0.33332093 0.33334378]
[0.43652988 0.25000742 0.3134627 ]
[0.2500279 0.4999721 0.25 ]]
B:
[[9.91856630e-17 1.12533719e-04 1.06467892e-01 3.70235380e-05 8.93382551e-01]
[7.40229993e-17 3.33285967e-01 1.91726675e-07 6.66712274e-01 1.56719927e-06]
[5.31681136e-01 1.48466699e-11 4.68318638e-01 1.95615241e-11 2.25912305e-07]]
pi:
[1.16786216e-17 3.50652002e-23 1.00000000e+00]
I:
[2, 1, 2]

参考资料:
《统计学习方法》李航著

你可能感兴趣的:(机器学习,NLP)