摘要:
本文主要介绍隐马尔可夫模型HMM的python实现,参考的文献主要是:
[1]. Lawrence R. Rabiner, ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,’ Proceedings of the IEEE, Vol. 77, No. 2, pp. 257-286, February, 1989
[2]. Dawei Shen, 'Some Mathematics for HMM' October 13th, 2008
[3]. 统计学习方法, 李航
[4]. 数学之美, 吴军
文献一是非常经典的Hidden Markov Models的入门文章,思路清晰,讲解透彻,且有一定深度;文献二和三也有很多内容是参考文献一来的。
文献二主要的优势是一些计算式推导出了一些简便的结果,使代码实现起来更简单,从而在另一个角度提供了更清晰的概念。
《统计学习方法》中的第十章专门讲了隐马尔可夫模型,更是用拉格朗日乘子法写出拉格朗日函数,进而求偏导数的方法推导出Baum-Welch算法的模型参数估计公式,有趣的是,文献一是用概率理论的方式直接写出参数估计公式的,两者的结果当然是完全一样。另外这本书还提供了一个“盒子和球模型”的实例来说明隐马尔可夫模型的相关概念,非常便于快速理解隐马尔可夫模型到底是怎么一回事。读过这本书的网友应该知道李航老师这本书内容丰富但只有区区两百多页,每篇都是干货满满,简明扼要。当然,有些地方实在“太扼要”了。
《数学之美》这本书第26章比较清晰地说明了维特比Viterbi算法的原理,如果不太理解这个算法的,可以去阅读这本书。另外,这本书不仅适合初学者或着外行人阅读并理解相关机器学习的算法,思路;同时,它对有一定基础的MLer也非常有助益。
本文不会介绍太多概念性的东西,会直接推荐读者参阅哪份文献的哪一部分。而且,也会直接引用参考文献的公式(或等式)来阐述内容。所以想随本文一起学习的读者需要准备相关的文献(前两份文献可以直接搜索,两者都是可以下载的pdf文档)。
Outline 大纲:
Notation
维特比算法的python实现
前向算法 ,后向算法,gamma(γ),xi(ξ),Baum-Welch算法及其python实现
带scale的Baum-Welch算法,多项观测序列带scale的Baum-Welch算法,
拓展
1. Notation
S={s1,s2,…,sN,} 表示所有可能状态的集合(《统计学习方法》使用Q来表示),
N 表示可能的状态数,
V={v1,v2,…,vM,} 表示可能的观测的集合,
M 表示可能的观测数,
I 表示长度为T的状态序列,O是对应的观测序列,
A 表示状态转移概率矩阵,
B 表示观测概率矩阵,
Pi 表示初始状态概率向量,
Lambda =(A, B, Pi)表示隐马尔可夫模型的参数模型。
如果没有接触过 或者不熟悉隐马尔可夫模型(HMM)的读者 可以参阅《统计学习方法》, 它提供了一个“盒子和球模型”的实例来说明隐马尔可夫模型的相关概念,非常便于快速理解隐马尔可夫模型到底是怎么一回事。
2. 维特比算法的python实现
通常来说隐马尔可夫模型(以下简称HMM)有3个基本问题:概率计算问题,学习问题,预测问题(也称解码问题)。第一个问题对应前向算法和后向算法;二,三一般使用Baum-Welch算法,维特比Viterbi算法来解决。
大部分时候前向算法和后向算法都是为Baum-Welch算法服务的,而维特比Viterbi算法是单独存在的,所以我们先讲维特比(以下简称Viterbi)算法。
Viterbi算法实际是用了动态规划原理简化了求解的复杂度。之所以说简化,是相对穷举法(列出所有可能的状态序列I,然后取概率最大的一个序列)而言的。
如果不理解Viterbi算法可以参阅《数学之美》第26章。
算法实现就很简单了,直接参照《统计学习方法》第185页,算法10.5的步骤。
首先定义一个类:
1 importnumpy as np2
3 classHMM:4 def __init__(self, Ann, Bnm, Pi, O):5 self.A =np.array(Ann, np.float)6 self.B =np.array(Bnm, np.float)7 self.Pi =np.array(Pi, np.float)8 self.O =np.array(O, np.float)9 self.N =self.A.shape[0]10 self.M = self.B.shape[1]
Viterbi算法作为类HMM的函数,相关代码如下
1 defviterbi(self):2 #given O,lambda .finding I
3
4 T =len(self.O)5 I =np.zeros(T, np.float)6
7 delta =np.zeros((T, self.N), np.float)8 psi =np.zeros((T, self.N), np.float)9
10 for i inrange(self.N):11 delta[0, i] = self.Pi[i] *self.B[i, self.O[0]]12 psi[0, i] =013
14 for t in range(1, T):15 for i inrange(self.N):16 delta[t, i] = self.B[i,self.O[t]] * np.array( [delta[t-1,j] *self.A[j,i]17 for j inrange(self.N)] ).max()18 psi[t,i] = np.array( [delta[t-1,j] *self.A[j,i]19 for j inrange(self.N)] ).argmax()20
21 P_T = delta[T-1, :].max()22 I[T-1] = delta[T-1, :].argmax()23
24 for t in range(T-2, -1, -1):25 I[t] = psi[t+1, I[t+1]]26
27 return I
delta,psi 分别是 δ,ψ
其中np, 是import numpy as np, numpy这个包很好用,它的argmax()方法在这里非常实用。
3. 前向算法 ,后向算法,gamma-γ,xi-ξ,Baum_Welch算法及其python实现
HMM的公式推导包含很多概率值,如果你不能比较好地理解概率相关知识的话,相应的公式推导过程会比较难以理解,可以阅读Bishop写的《Pattern Recognition And Machine Learning》这本书,当然,在机器学习方面这本书一直都是经典。
前向(forward)概率矩阵alpha-α(公式书写时它根a很像,注意区分),
后向(backward)概率矩阵beta-β
算法定义和步骤参阅《统计学习方法》第175页或者文献二,
相关代码如下:
defforward(self):
T=len(self.O)
alpha=np.zeros((T, self.N), np.float)for i inrange(self.N):
alpha[0,i]= self.Pi[i] *self.B[i, self.O[0]]for t in range(T-1):for i inrange(self.N):
summation= 0 #for every i 'summation' should reset to '0'
for j inrange(self.N):
summation+= alpha[t,j] *self.A[j,i]
alpha[t+1, i] = summation * self.B[i, self.O[t+1]]
summation= 0.0
for i inrange(self.N):
summation+= alpha[T-1, i]
Polambda=summationreturn Polambda,alpha
defbackward(self):
T=len(self.O)
beta=np.zeros((T, self.N), np.float)for i inrange(self.N):
beta[T-1, i] = 1.0
for t in range(T-2,-1,-1):for i inrange(self.N):
summation= 0.0 #for every i 'summation' should reset to '0'
for j inrange(self.N):
summation+= self.A[i,j] * self.B[j, self.O[t+1]] * beta[t+1,j]
beta[t,i]=summation
Polambda= 0.0
for i inrange(self.N):
Polambda+= self.Pi[i] * self.B[i, self.O[0]] *beta[0, i]return Polambda, beta
Polambda表示P(O| λ)
接下来计算gamma-γ和 xi-ξ。 根据《统计学习方法》的公式可以得到如下代码:
defcompute_gamma(self,alpha,beta):
T=len(self.O)
gamma= np.zeros((T, self.N), np.float) #the probability of Ot=q
for t inrange(T):for i inrange(self.N):
gamma[t, i]= alpha[t,i] * beta[t,i] /sum(
alpha[t,j]* beta[t,j] for j inrange(self.N) )return gamma
defcompute_xi(self,alpha,beta):
T=len(self.O)
xi= np.zeros((T-1, self.N, self.N), np.float) #note that: not T
for t in range(T-1): #note: not T
for i inrange(self.N):for j inrange(self.N):
numerator= alpha[t,i] * self.A[i,j] * self.B[j,self.O[t+1]] * beta[t+1,j]#the multiply term below should not be replaced by 'nummerator',
#since the 'i,j' in 'numerator' are fixed.
#In addition, should not use 'i,j' below, to avoid error and confusion.
denominator =sum( sum(
alpha[t,i1]* self.A[i1,j1] * self.B[j1,self.O[t+1]] * beta[t+1,j1]for j1 in range(self.N) ) #the second sum
for i1 in range(self.N) ) #the first sum
xi[t,i,j] = numerator /denominatorreturn xi
注意计算时要传入参数alpha,beta
然后来实现Baum_Welch算法,根据《统计学习方法》或者文献二,
首先初始化参数,怎么初始化是很重要。因为Baum_Welch算法(亦是EM算法的一种特殊体现)并不能保证得到全局最优值,很容易就掉到局部最优然后出不来了。
当delta_lambda大于某一值时一直运行下去。
关于x的设置,如果过小,程序容易进入死循环,因为每一次的收敛过程lambda会有比较大的变化,那么当它接近局部/全局最优时,就会在左右徘徊一直是delta_lambda > x。
defBaum_Welch(self):#given O list finding lambda model(can derive T form O list)
#also given N, M,
T =len(self.O)
V= [k for k inrange(self.M)]#initialization - lambda
self.A = np.array(([[0,1,0,0],[0.4,0,0.6,0],[0,0.4,0,0.6],[0,0,0.5,0.5]]), np.float)
self.B= np.array(([[0.5,0.5],[0.3,0.7],[0.6,0.4],[0.8,0.2]]), np.float)#mean value may not be a good choice
self.Pi = np.array(([1.0 / self.N] * self.N), np.float) #must be 1.0 , if 1/3 will be 0
#self.A = np.array([[1.0 / self.N] * self.N] * self.N) # must array back, then can use[i,j]
#self.B = np.array([[1.0 / self.M] * self.M] * self.N)
x= 1delta_lambda= x + 1times=0#iteration - lambda
while delta_lambda > x: #x
Polambda1, alpha = self.forward() #get alpha
Polambda2, beta = self.backward() #get beta
gamma = self.compute_gamma(alpha,beta) #use alpha, beta
xi =self.compute_xi(alpha,beta)
lambda_n=[self.A,self.B,self.Pi]for i inrange(self.N):for j inrange(self.N):
numerator= sum(xi[t,i,j] for t in range(T-1))
denominator= sum(gamma[t,i] for t in range(T-1))
self.A[i, j]= numerator /denominatorfor j inrange(self.N):for k inrange(self.M):
numerator= sum(gamma[t,j] for t in range(T) if self.O[t] == V[k] ) #TBD
denominator = sum(gamma[t,j] for t inrange(T))
self.B[i, k]= numerator /denominatorfor i inrange(self.N):
self.Pi[i]=gamma[0,i]#if sum directly, there will be positive and negative offset
delta_A = map(abs, lambda_n[0] - self.A) #delta_A is still a matrix
delta_B = map(abs, lambda_n[1] -self.B)
delta_Pi= map(abs, lambda_n[2] -self.Pi)
delta_lambda=sum([ sum(sum(delta_A)), sum(sum(delta_B)), sum(delta_Pi) ])
times+= 1
printtimesreturn self.A, self.B, self.Pi
4.带scale的Baum-Welch算法,多项观测序列带scale的Baum-Welch算法
理论上来说上面已经完整地用代码实现了HMM, 然而事实总是没有那么简单,后续还有不少问题需要解决,不过这篇文章只提两点,一个是用scale解决计算过程中容易发送的浮点数下溢问题,另一个是同时输入多个观测序列的改进版Baum-Welch算法。
参考文献二: scaling problem
根据文献二的公式我们加入scale,重写forward(),backward(),Baum-Welch() 三个方法。
defforward_with_scale(self):
T=len(self.O)
alpha_raw=np.zeros((T, self.N), np.float)
alpha=np.zeros((T, self.N), np.float)
c= [i for i in range(T)] #scaling factor; 0 or sequence doesn't matter
for i inrange(self.N):
alpha_raw[0,i]= self.Pi[i] *self.B[i, self.O[0]]
c[0]= 1.0 / sum(alpha_raw[0,i] for i inrange(self.N))for i inrange(self.N):
alpha[0, i]= c[0] *alpha_raw[0,i]for t in range(T-1):for i inrange(self.N):
summation= 0.0
for j inrange(self.N):
summation+= alpha[t,j] *self.A[j, i]
alpha_raw[t+1, i] = summation * self.B[i, self.O[t+1]]
c[t+1] = 1.0 / sum(alpha_raw[t+1,i1] for i1 inrange(self.N))for i inrange(self.N):
alpha[t+1, i] = c[t+1] * alpha_raw[t+1, i]returnalpha, cdefbackward_with_scale(self,c):
T=len(self.O)
beta_raw=np.zeros((T, self.N), np.float)
beta=np.zeros((T, self.N), np.float)for i inrange(self.N):
beta_raw[T-1, i] = 1.0beta[T-1, i] = c[T-1] * beta_raw[T-1, i]for t in range(T-2,-1,-1):for i inrange(self.N):
summation= 0.0
for j inrange(self.N):
summation+= self.A[i,j] * self.B[j, self.O[t+1]] * beta[t+1,j]
beta[t,i]= c[t] * summation #summation = beta_raw[t,i]
return beta
defBaum_Welch_with_scale(self):
T=len(self.O)
V= [k for k inrange(self.M)]#initialization - lambda , should be float(need .0)
self.A = np.array([[0.2,0.2,0.3,0.3],[0.2,0.1,0.6,0.1],[0.3,0.4,0.1,0.2],[0.3,0.2,0.2,0.3]])
self.B= np.array([[0.5,0.5],[0.3,0.7],[0.6,0.4],[0.8,0.2]])
x= 5delta_lambda= x + 1times=0#iteration - lambda
while delta_lambda > x: #x
alpha,c =self.forward_with_scale()
beta=self.backward_with_scale(c)
lambda_n=[self.A,self.B,self.Pi]for i inrange(self.N):for j inrange(self.N):
numerator_A= sum(alpha[t,i] * self.A[i,j] * self.B[j, self.O[t+1]]* beta[t+1,j] for t in range(T-1))
denominator_A= sum(alpha[t,i] * beta[t,i] / c[t] for t in range(T-1))
self.A[i, j]= numerator_A /denominator_Afor j inrange(self.N):for k inrange(self.M):
numerator_B= sum(alpha[t,j] * beta[t,j] /c[t]for t in range(T) if self.O[t] == V[k] ) #TBD
denominator_B = sum(alpha[t,j] * beta[t,j] / c[t] for t inrange(T))
self.B[j, k]= numerator_B /denominator_B#Pi have no business with c
denominator_Pi = sum(alpha[0,j] * beta[0,j] for j inrange(self.N))for i inrange(self.N):
self.Pi[i]= alpha[0,i] * beta[0,i] /denominator_Pi#self.Pi[i] = gamma[0,i]
#if sum directly, there will be positive and negative offset
delta_A = map(abs, lambda_n[0] - self.A) #delta_A is still a matrix
delta_B = map(abs, lambda_n[1] -self.B)
delta_Pi= map(abs, lambda_n[2] -self.Pi)
delta_lambda=sum([ sum(sum(delta_A)), sum(sum(delta_B)), sum(delta_Pi) ])
times+= 1
printtimesreturn self.A, self.B, self.Pi
第二个问题,根据文献二,我们直接实现带scale的修改版Baum-Welch算法。为了方便,我们将这个函数单独出来,写在HMM类的外面:
#for multiple sequences of observations symbols(with scaling alpha & beta)#out of class HMM, independent function
defmodified_Baum_Welch_with_scale(O_set):#initialization - lambda
A = np.array([[0.2,0.2,0.3,0.3],[0.2,0.1,0.6,0.1],[0.3,0.4,0.1,0.2],[0.3,0.2,0.2,0.3]])
B= np.array([[0.2,0.2,0.3,0.3],[0.2,0.1,0.6,0.1],[0.3,0.4,0.1,0.2],[0.3,0.2,0.2,0.3]])#B = np.array([[0.5,0.5],[0.3,0.7],[0.6,0.4],[0.8,0.2]])
Pi = [0.25,0.25,0.25,0.25]#computing alpha_set, beta_set, c_set
O_length =len(O_set)
whatever= [j for j inrange(O_length)]
alpha_set, beta_set=whatever, whatever
c_set= [j for j in range(O_length)] #can't use whatever, the c_set will be 3d-array ???
N=A.shape[0]
M= B.shape[1]
T= [j for j in range(O_length)] #can't use whatever, the beta_set will be 1d-array ???
for i inrange(O_length):
T[i]=len(O_set[i])
V= [k for k inrange(M)]
x= 1delta_lambda= x + 1times=0while delta_lambda > x: #iteration - lambda
lambda_n =[A, B]for i inrange(O_length):
alpha_set[i], c_set[i]=HMM(A, B, Pi, O_set[i]).forward_with_scale()
beta_set[i]=HMM(A, B, Pi, O_set[i]).backward_with_scale(c_set[i])for i inrange(N):for j inrange(N):
numerator_A= 0.0denominator_A= 0.0
for l inrange(O_length):
raw_numerator_A= sum( alpha_set[l][t,i] * A[i,j] * B[j, O_set[l][t+1]]* beta_set[l][t+1,j] for t in range(T[l]-1) )
numerator_A+=raw_numerator_A
raw_denominator_A= sum( alpha_set[l][t,i] * beta_set[l][t,i] /c_set[l][t]for t in range(T[l]-1) )
denominator_A+=raw_denominator_A
A[i, j]= numerator_A /denominator_Afor j inrange(N):for k inrange(M):
numerator_B= 0.0denominator_B= 0.0
for l inrange(O_length):
raw_numerator_B= sum( alpha_set[l][t,j] *beta_set[l][t,j]/ c_set[l][t] for t in range(T[l]) if O_set[l][t] ==V[k] )
numerator_B+=raw_numerator_B
raw_denominator_B= sum( alpha_set[l][t,j] *beta_set[l][t,j]/ c_set[l][t] for t inrange(T[l]) )
denominator_B+=raw_denominator_B
B[j, k]= numerator_B /denominator_B#Pi should not need to computing in this case,
#in other cases, will get some corresponding Pi
#if sum directly, there will be positive and negative offset
delta_A = map(abs, lambda_n[0] - A) #delta_A is still a matrix
delta_B = map(abs, lambda_n[1] -A)
delta_lambda=sum([ sum(sum(delta_A)), sum(sum(delta_B)) ])
times+= 1
printtimesreturn A, B
这里我们不重新估算pi,在实际应用中pi根据情况而定,有时并不需要。
正如作者所说:By using scaling, we happily find that all Pl’s terms are cancelled out! The resulting format looks much cleaner!
这真的是一个happily finding。
这样就实现了一个基本的HMM, 虽然比较基础,但是得益于这个模型本身的强大,区区这200多行代码已经有很强大的功能,读者可以试一试,操作得当应该可以用来实现一些简单的预测。
5.扩展
引用文献一的原话:
V. Implementation issues for HMMs
The discussion in the previous two sections has primarily dealt with the theory of HMMs and several variations on the form of the model. In this section we deal with several practical implementation issues including scaling, multiple observation sequences, initial parameter estimates, missing data, and choice of model size and type. For some of these implementation issues we can prescribe exact analytical solutions; for other issues we can only provide some set-of-the-pants experience gained from working with HMMs over the last several years.
读者可以研读文献一后面的内容,以便进一步学习,也可以搜索相关的新文献来深入学习,毕竟这篇文章是1989年出的。
文章写的比较匆忙,如有疑问,欢迎评论。
感谢浏览!