以下均为个人理解,有些公式可能需要右滑动页面才可以查看
摘自周志华老师的《机器学习》1
概率图模型是一类用图来表达变量相关关系的概率模型,最常见的是用一个节点表示一个或一组随机变量,节点之间的边表示随机变量之间的概率相关关系
根据边的性质不同,概率图模型分为两类:用有向边表示随机变量间的依赖关系的图,称为贝叶斯往或有向图模型,用无向图表示变量间的相关关系称为无向图模型或是马尔可夫网
摘自周志华老师的《机器学习》1
HMM是一种有向图模型,主要用于时序数据建模,在语音识别、自然语言处理领域都有广泛应用,HMM的变量分为两组,第一组是状态变量{ y 1 , y 2 , . . . . , y n y_1,y_2,....,y_n y1,y2,....,yn},其中 y i y_i yi表示 i i i时刻的系统状态,状态变量一般是隐藏的,第二组是观测变量{ x 1 , x 2 , . . . . . , x n x_1,x_2,.....,x_n x1,x2,.....,xn},其中 x i x_i xi表示 i i i时刻的观测值,观测变量一般是已知的,观测变量可以是离散的,也可以是连续的,状态变量是离散的
HMM的概率图结构如下:
箭头表示了变量间的依赖关系,在t时刻,观测变量 x t x_t xt的取值仅仅依赖于状态变量 y t y_t yt的取值,与其他状态变量、观测变量无关。t时刻的状态变量 y t y_t yt的取值仅依赖于t-1时刻的状态变量 y t − 1 y_{t-1} yt−1,与之前的t-2个变量无直接相关
从上图出发,我们得到下列结论:
假设观测序列取值为 ( o 1 , o 2 , . . . . , o T ) (o_1,o_2,....,o_T) (o1,o2,....,oT),状态变量取值为 ( i 1 , i 2 , . . . . , i T ) (i_1,i_2,....,i_T) (i1,i2,....,iT),则该观测序列,状态变量出现的概率为: P ( o 1 , o 2 , . . . . . , o T , i 1 , i 2 , . . . . , i T ) = P ( i 1 ) P ( o 1 ∣ i 1 ) P ( i 2 ∣ i 1 ) P ( o 2 ∣ i 2 ) . . . . . P ( i T ∣ i T − 1 ) P ( o T ∣ i T ) ( 式 1.0 ) P(o_1,o_2,.....,o_T,i_1,i_2,....,i_T)=P(i_1)P(o_1|i_1)P(i_2|i_1)P(o_2|i_2).....P(i_T|i_{T-1})P(o_T|i_T)(式1.0) P(o1,o2,.....,oT,i1,i2,....,iT)=P(i1)P(o1∣i1)P(i2∣i1)P(o2∣i2).....P(iT∣iT−1)P(oT∣iT)(式1.0)
从式(1.0)出发,我们思考一下式(1.0)需要什么
接下来的所有讨论,我们均设状态变量的取值范围为{ s 1 , s 2 , . . . . , s N s_1,s_2,....,s_N s1,s2,....,sN},观测变量的取值范围为{ o 1 , o 2 , . . . . , o M o_1,o_2,....,o_M o1,o2,....,oM}
我们定义了上述三类参数,我们将后两个参数用矩阵的形式表达:
最终,我们定义HMM模型的表现形式为 λ = [ A , B , π ] \lambda=[A,B,\pi] λ=[A,B,π]
理一下,这一模块的介绍思路是:由HMM概率图的定义出发,获得式1.0,由式1.0未知的部分,我们定义了初始状态概率向量、状态转义概率矩阵、输出观测概率矩阵,最终确定了HMM模型的表现形式
前两个问题与我们训练HMM模型的目的有关
问题一:给定模型 λ = [ A , B , π ] \lambda=[A,B,\pi] λ=[A,B,π]和观测序列 O = ( o 1 , o 2 , . . . . , o T ) O=(o_1,o_2,....,o_T) O=(o1,o2,....,oT),计算在 λ \lambda λ下观测序列O出现的概率,即 P ( O ∣ λ ) P(O|\lambda) P(O∣λ)
问题二:给定模型 λ = [ A , B , π ] \lambda=[A,B,\pi] λ=[A,B,π]和观测序列 O = ( o 1 , o 2 , . . . . , o T ) O=(o_1,o_2,....,o_T) O=(o1,o2,....,oT),计算最有可能出现的状态序列,设状态序列 I I I为 ( i 1 , i 2 , . . . , i T ) (i_1,i_2,...,i_T) (i1,i2,...,iT),即使得 P ( I ∣ O , λ ) P(I|O,\lambda) P(I∣O,λ)取值最大的状态序列
问题三:己知观测序列O,估计模型 λ \lambda λ的参数,使得在该模型下观测序列概率 P ( O ∣ λ ) P(O|\lambda) P(O∣λ)最大。即如何根据观测序列训练模型
那么基于模型 λ \lambda λ,能不能达到上述两个目的呢?
首先,先明白前向概率与后向概率
给定模型 λ \lambda λ,到时刻t为止的观测序列为( o 1 , o 2 , . . . . . , o t o_1,o_2,.....,o_t o1,o2,.....,ot),t时刻的状态变量 i t i_t it取值为 i i i的概率,即 α t ( i ) = P ( o 1 , o 2 , . . . . , o t , i t = s i ∣ λ ) \alpha_t(i)=P(o_1,o_2,....,o_t,i_t=s_i|\lambda) αt(i)=P(o1,o2,....,ot,it=si∣λ)
给定模型 λ \lambda λ,已知t时刻状态变量 i t i_t it取值为 s i s_i si,求t+1时刻到T时刻的观测序列为( o t + 1 , o t + 2 , . . . . . , o T o_{t+1},o_{t+2},.....,o_T ot+1,ot+2,.....,oT)的概率,即 β t ( i ) = P ( o t + 1 , o t + 2 , . . . . . , o T ∣ λ , , i t = s i ) \beta_t(i)=P(o_{t+1},o_{t+2},.....,o_T|\lambda,,i_t=s_i) βt(i)=P(ot+1,ot+2,.....,oT∣λ,,it=si)
前向概率的表达式为: α t ( s i ) = P ( o 1 , o 2 , . . . . , o t , i t = s i ∣ λ ) \alpha_t(s_i)=P(o_1,o_2,....,o_t,i_t=s_i|\lambda) αt(si)=P(o1,o2,....,ot,it=si∣λ),即只需要t时刻状态变量的取值为 s i s_i si,t时刻之前的状态变量的取值是任意的,则有
α t ( s i ) = P ( o 1 , o 2 , . . . , o t , i t = s i , i t − 1 = s 1 ∣ λ ) + P ( o 1 , o 2 , . . . , o t , i t = s i , i t − 1 = s 2 ∣ λ ) + . . . . + P ( o 1 , o 2 , . . . , o t , i t = s i , i t − 1 = s N ∣ λ ) \alpha_t(s_i)=P(o_1,o_2,...,o_t,i_t=s_i,i_{t-1}=s_1|\lambda)\\+P(o_1,o_2,...,o_t,i_t=s_i,i_{t-1}=s_2|\lambda)+\\....\\+P(o_1,o_2,...,o_t,i_t=s_i,i_{t-1}=s_N|\lambda) αt(si)=P(o1,o2,...,ot,it=si,it−1=s1∣λ)+P(o1,o2,...,ot,it=si,it−1=s2∣λ)+....+P(o1,o2,...,ot,it=si,it−1=sN∣λ)
由于t时刻状态变量的取值仅仅取决于t-1时刻状态变量的取值,t时刻观测变量的取值仅仅取决于t时刻状态变量的取值,则有
P ( o 1 , o 2 , . . . , o t , i t = s i , i t − 1 = s j ∣ λ ) = P ( o 1 , o 2 , . . . , o t − 1 , i t − 1 = s j ∣ λ ) P ( i t = s i ∣ i t − 1 = s j , λ ) b i ( o t ) = α t − 1 ( s j ) a j i b i ( o t ) P(o_1,o_2,...,o_t,i_t=s_i,i_{t-1}=s_j|\lambda)\\=P(o_1,o_2,...,o_{t-1},i_{t-1}=s_j|\lambda)P(i_t=s_i|i_{t-1}=s_j,\lambda)b_i(o_t)=\alpha_{t-1}(s_j)a_{ji}b_i(o_t) P(o1,o2,...,ot,it=si,it−1=sj∣λ)=P(o1,o2,...,ot−1,it−1=sj∣λ)P(it=si∣it−1=sj,λ)bi(ot)=αt−1(sj)ajibi(ot)
所以 α t ( s i ) = ∑ j = 1 N α t − 1 ( s j ) a j i b i ( o t ) \alpha_t(s_i)=\sum_{j=1}^N\alpha_{t-1}(s_j)a_{ji}b_i(o_t) αt(si)=j=1∑Nαt−1(sj)ajibi(ot)
所以有
P ( O ∣ λ ) = P ( O , i T = s 1 ∣ λ ) + P ( O , i T = s 2 ∣ λ ) + . . . . + P ( O , i T = s N ∣ λ ) = ∑ i = 1 N α T ( s i ) α T ( s i ) = ∑ j = 1 N α T − 1 ( s j ) a j i b i ( o T ) α T − 1 ( s i ) = ∑ j = 1 N α T − 2 ( s j ) a j i b i ( o T − 1 ) P(O|\lambda)=P(O,i_T=s_1|\lambda)+P(O,i_T=s_2|\lambda)+....+P(O,i_T=s_N|\lambda)=\sum_{i=1}^N\alpha_T(s_i)\\ \alpha_T(s_i)=\sum_{j=1}^N\alpha_{T-1}(s_j)a_{ji}b_i(o_T)\\ \alpha_{T-1}(s_i)=\sum_{j=1}^N\alpha_{T-2}(s_j)a_{ji}b_i(o_{T-1}) P(O∣λ)=P(O,iT=s1∣λ)+P(O,iT=s2∣λ)+....+P(O,iT=sN∣λ)=i=1∑NαT(si)αT(si)=j=1∑NαT−1(sj)ajibi(oT)αT−1(si)=j=1∑NαT−2(sj)ajibi(oT−1)
不断递推,最终得到时刻1的前向概率,依据HMM的概率图,那么时刻1的前向概率为:
α 1 ( s i ) = P ( o 1 , i 1 = s i ∣ λ ) = P ( i 1 = s i ∣ λ ) P ( o 1 ∣ i 1 = s i , λ ) = π i b i ( o 1 ) \alpha_{1}(s_i)=P(o_1,i_1=s_i|\lambda)=P(i_1=s_i|\lambda)P(o_1|i_1=s_i,\lambda)=\pi_ib_i(o_1) α1(si)=P(o1,i1=si∣λ)=P(i1=si∣λ)P(o1∣i1=si,λ)=πibi(o1)
后向概率的表达式为
β t ( i ) = P ( o t + 1 , o t + 2 , . . . . . , o T ∣ λ , , i t = s i ) \beta_t(i)=P(o_{t+1},o_{t+2},.....,o_T|\lambda,,i_t=s_i) βt(i)=P(ot+1,ot+2,.....,oT∣λ,,it=si)
即已知t时刻状态变量的取值为 s i s_i si,t+1到T时刻状态变量的取值是任意的概率,我们有
β t ( s i ) = P ( o t + 1 , o t + 2 , . . . . . , o T ∣ λ , i t = s i ) = P ( o t + 1 , o t + 2 , . . . . . , o T , i t + 1 = s 1 ∣ λ , i t = s i ) + P ( o t + 1 , o t + 2 , . . . . . , o T , i t + 1 = s 2 ∣ λ , i t = s i ) + . . . . + P ( o t + 1 , o t + 2 , . . . . . , o T , i t + 1 = s N ∣ λ , i t = s i ) \beta_t(s_i)=P(o_{t+1},o_{t+2},.....,o_T|\lambda,i_t=s_i)\\=P(o_{t+1},o_{t+2},.....,o_T,i_{t+1}=s_1|\lambda,i_t=s_i)\\+P(o_{t+1},o_{t+2},.....,o_T,i_{t+1}=s_2|\lambda,i_t=s_i)\\+....+P(o_{t+1},o_{t+2},.....,o_T,i_{t+1}=s_N|\lambda,i_t=s_i) βt(si)=P(ot+1,ot+2,.....,oT∣λ,it=si)=P(ot+1,ot+2,.....,oT,it+1=s1∣λ,it=si)+P(ot+1,ot+2,.....,oT,it+1=s2∣λ,it=si)+....+P(ot+1,ot+2,.....,oT,it+1=sN∣λ,it=si)
从条件概率的公式出发,我们有: P ( A B ∣ λ , C ) = P ( A B C ∣ λ ) P ( C ∣ λ ) = P ( A B C ∣ λ ) P ( B C ∣ λ ) P ( B C ∣ λ ) P ( C ∣ λ ) = P ( A ∣ B , C , λ ) P ( B ∣ C , λ ) ( 式 1.1 ) P(AB|\lambda,C)=\frac{P(ABC|\lambda)}{P(C|\lambda)}=\frac{P(ABC|\lambda)}{P(BC|\lambda)}\frac{P(BC|\lambda)}{P(C|\lambda)}=P(A|B,C,\lambda)P(B|C,\lambda)(式1.1) P(AB∣λ,C)=P(C∣λ)P(ABC∣λ)=P(BC∣λ)P(ABC∣λ)P(C∣λ)P(BC∣λ)=P(A∣B,C,λ)P(B∣C,λ)(式1.1)
对于 P ( o t + 1 , o t + 2 , . . . . . , o T , i t + 1 = s 1 ∣ λ , i t = s i ) P(o_{t+1},o_{t+2},.....,o_T,i_{t+1}=s_1|\lambda,i_t=s_i) P(ot+1,ot+2,.....,oT,it+1=s1∣λ,it=si),设A为 o t + 1 , o t + 2 , . . . . . , o T o_{t+1},o_{t+2},.....,o_T ot+1,ot+2,.....,oT,B为 i t + 1 = s 1 i_{t+1}=s_1 it+1=s1,C为 i t = s i i_t=s_i it=si,则有:
P ( o t + 1 , o t + 2 , . . . . . , o T , i t + 1 = s 1 ∣ λ , i t = s i ) = P ( o t + 1 , o t + 2 , . . . . . , o T ∣ λ , i t = s i , i t + 1 = s 1 ) P ( i t + 1 = s 1 ∣ λ , i t = s i ) P(o_{t+1},o_{t+2},.....,o_T,i_{t+1}=s_1|\lambda,i_t=s_i)\\=P(o_{t+1},o_{t+2},.....,o_T|\lambda,i_t=s_i,i_{t+1}=s_1)P(i_{t+1}=s_1|\lambda,i_t=s_i) P(ot+1,ot+2,.....,oT,it+1=s1∣λ,it=si)=P(ot+1,ot+2,.....,oT∣λ,it=si,it+1=s1)P(it+1=s1∣λ,it=si)
由HMM的概率图可知t时刻的状态变量的取值无法影响t+1时刻观测变量的取值,所以有
P ( o t + 1 , o t + 2 , . . . . . , o T ∣ λ , i t = s i , i t + 1 = s 1 ) = P ( o t + 1 , o t + 2 , . . . . . , o T ∣ λ , i t + 1 = s 1 ) P(o_{t+1},o_{t+2},.....,o_T|\lambda,i_t=s_i,i_{t+1}=s_1)=P(o_{t+1},o_{t+2},.....,o_T|\lambda,i_{t+1}=s_1) P(ot+1,ot+2,.....,oT∣λ,it=si,it+1=s1)=P(ot+1,ot+2,.....,oT∣λ,it+1=s1)
设B为 o t + 1 o_{t+1} ot+1,A为 o t + 2 , . . . . . , o T o_{t+2},.....,o_T ot+2,.....,oT,C为 i t + 1 = s 1 i_{t+1}=s_1 it+1=s1,利用式(1.1),类似的,我们可以得到
P ( o t + 1 , o t + 2 , . . . . . , o T ∣ λ , i t + 1 = s 1 ) = P ( o t + 1 , o t + 2 , . . . . . , o T ∣ λ , i t + 1 = s 1 ) = P ( o t + 2 , o t + 3 . . . . . , o T ∣ λ , i t + 1 = s 1 ) P ( o t + 1 ∣ λ , i t + 1 = s 1 ) = β t + 1 ( s 1 ) b 1 ( o t + 1 ) P(o_{t+1},o_{t+2},.....,o_T|\lambda,i_{t+1}=s_1)\\=P(o_{t+1},o_{t+2},.....,o_T|\lambda,i_{t+1}=s_1)\\=P(o_{t+2},o_{t+3}.....,o_T|\lambda,i_{t+1}=s_1)P(o_{t+1}|\lambda,i_{t+1}=s_1)\\=\beta_{t+1}(s_1)b_1(o_{t+1}) P(ot+1,ot+2,.....,oT∣λ,it+1=s1)=P(ot+1,ot+2,.....,oT∣λ,it+1=s1)=P(ot+2,ot+3.....,oT∣λ,it+1=s1)P(ot+1∣λ,it+1=s1)=βt+1(s1)b1(ot+1)
而 P ( i t + 1 = s 1 ∣ λ , i t = s i ) = a i 1 P(i_{t+1}=s_1|\lambda,i_t=s_i)=a_{i1} P(it+1=s1∣λ,it=si)=ai1
所以有 P ( o t + 1 , o t + 2 , . . . . . , o T , i t + 1 = s 1 ∣ λ , i t = s i ) = a i 1 b 1 ( o t + 1 ) β t + 1 ( s 1 ) P(o_{t+1},o_{t+2},.....,o_T,i_{t+1}=s_1|\lambda,i_t=s_i)=a_{i1}b_1(o_{t+1})\beta_{t+1}(s_1) P(ot+1,ot+2,.....,oT,it+1=s1∣λ,it=si)=ai1b1(ot+1)βt+1(s1)
最后,我们得到 β t ( s i ) = ∑ j = 1 N b j ( o t + 1 ) a i j β t + 1 ( s j ) \beta_t(s_i)=\sum_{j=1}^N b_j(o_{t+1})a_{ij}\beta_{t+1}(s_j) βt(si)=j=1∑Nbj(ot+1)aijβt+1(sj)
利用后向概率,我们有
P ( O ∣ λ ) = ∑ i = 1 N π i b i ( o 1 ) β 1 ( s i ) β 1 ( s i ) = ∑ j = 1 N a i j b j ( o 2 ) β 2 ( s j ) β 2 ( s i ) = ∑ j = 1 N a i j b j ( o 3 ) β 3 ( s j ) . . . . . β T ( s i ) = 1 P(O|\lambda)=\sum_{i=1}^N\pi_ib_i(o_1)\beta_1(s_i)\\ \beta_1(s_i)=\sum_{j=1}^Na_{ij}b_j(o_2)\beta_2(s_j)\\ \beta_2(s_i)=\sum_{j=1}^Na_{ij}b_j(o_3)\beta_3(s_j)\\ .....\\ \beta_T(s_i)=1 P(O∣λ)=i=1∑Nπibi(o1)β1(si)β1(si)=j=1∑Naijbj(o2)β2(sj)β2(si)=j=1∑Naijbj(o3)β3(sj).....βT(si)=1
时刻T的后向概率比较特殊,因为此时所有数值的取值均已知,则T时刻的后向概率为1,即 β T ( s i ) = 1 , 1 ≤ i ≤ N \beta_T(s_i)=1,1\leq i \leq N βT(si)=1,1≤i≤N
P ( O ∣ λ ) = P ( o 1 , o 2 , . . . . , o t , i t = s 1 ∣ λ ) P ( o t + 1 , o t + 2 , . . . . , o T ∣ λ , i t = s 1 ) + P ( o 1 , o 2 , . . . . , o t , i t = s 2 ∣ λ ) P ( o t + 1 , o t + 2 , . . . . , o T ∣ λ , i t = s 2 ) + . . . . . + P ( o 1 , o 2 , . . . . , o t , i t = s N ∣ λ ) P ( o t + 1 , o t + 2 , . . . . , o T ∣ λ , i t = s N ) = ∑ i = 1 n α t ( i ) β t ( i ) P(O|\lambda)=\\P(o_1,o_2,....,o_t,i_t=s_1|\lambda)P(o_{t+1},o_{t+2},....,o_{T}|\lambda,i_t=s_1)+P(o_1,o_2,....,o_t,i_t=s_2|\lambda)P(o_{t+1},o_{t+2},....,o_{T}|\lambda,i_t=s_2)\\+.....+P(o_1,o_2,....,o_t,i_t=s_N|\lambda)P(o_{t+1},o_{t+2},....,o_{T}|\lambda,i_t=s_N)\\=\sum_{i=1}^n\alpha_t(i)\beta_t(i) P(O∣λ)=P(o1,o2,....,ot,it=s1∣λ)P(ot+1,ot+2,....,oT∣λ,it=s1)+P(o1,o2,....,ot,it=s2∣λ)P(ot+1,ot+2,....,oT∣λ,it=s2)+.....+P(o1,o2,....,ot,it=sN∣λ)P(ot+1,ot+2,....,oT∣λ,it=sN)=i=1∑nαt(i)βt(i)
根据HMM模型的表现形式以及式1.0,我们可以将HMM模型表现成下图:
横坐标方向表示时刻,图上的每个点表示一种状态变量的取值,第一行所有点表示状态变量的取值为 S 1 S_1 S1,边上的权重为对应的状态转移概率乘以对应的输出观测概率,给定观测序列,上图的权重可以全部确定,则有
m a x ( P ( I ∣ O , λ ) ) = m a x ( P ( I O λ ) P ( O λ ) ) = m a x ( P ( O , I ∣ λ ) P ( O ∣ λ ) ) ( 式 1.2 ) max(P(I|O,\lambda))=max(\frac{P(IO\lambda)}{P(O\lambda)})=max(\frac{P(O,I|\lambda)}{P(O|\lambda)})(式1.2) max(P(I∣O,λ))=max(P(Oλ)P(IOλ))=max(P(O∣λ)P(O,I∣λ))(式1.2)
由于 P ( O ∣ λ ) P(O|\lambda) P(O∣λ)是一个常数,则式1.2等于 m a x ( P ( I ∣ O , λ ) ) = m a x ( P ( O , I ∣ λ ) ) = m a x ( ∑ I π i 1 b i 1 ( o 1 ) ∏ t = 2 N a i t i t + 1 b i t + 1 ( o t ) ) max(P(I|O,\lambda))=max(P(O,I|\lambda))=max(\sum_I\pi_{i_{1}}b_{i_1}(o_1) \prod_{t=2}^{N}a_{i_ti_{t+1}}b_{i_{t+1}}(o_t)) max(P(I∣O,λ))=max(P(O,I∣λ))=max(I∑πi1bi1(o1)t=2∏Naitit+1bit+1(ot))
即在图上找一条路径,使得路径的权重的乘积最大,问题二可以通过维特比算法解决,即用动态规划(dynamic programming)求最优路径,其基于以下规律:假设最优路径为 i 1 i_1 i1到 i T i_T iT,假设中间节点为 i t i_t it,则 i 1 i_1 i1到 i t i_t it, i t i_t it到 i T i_T iT的路径都是最优的,假设 δ t ( s i ) \delta_t(s_i) δt(si)表示时刻1到时刻t,到达状态 s i s_i si的所有路径中,权重乘积最大的路径,则有: δ t ( s i ) = m a x 1 ≤ j ≤ N ( δ j ( o t − 1 ) a j i b i ( o t ) ) ( 式 1.4 ) \delta_t(s_i)=max_{1 \le j \le N}(\delta_{j}(o_{t-1})a_{ji}b_{i}(o_t))(式1.4) δt(si)=max1≤j≤N(δj(ot−1)ajibi(ot))(式1.4)
依据式1.4,同时用一个二维数组记录最优路径,行表示状态,列表示时刻,二维数组中每个值表示本时刻的上一个状态。
如果我们不仅有观测序列,还有对应的状态变量取值,则利用大数定理,当训练数据比较多时,我们利用频率去接近概率
设训练数据中,状态变量取值由 s i s_i si转变为 s j s_j sj的次数为 A i j A_{ij} Aij, 1 ≤ j ≤ N 1 \le j \le N 1≤j≤N,则有 a i j = A i j ∑ k = 1 N A i k a_{ij}=\frac{A_{ij}}{\sum_{k=1}^{N}A_{ik}} aij=∑k=1NAikAij
设时刻1的状态变量取值为 s i s_i si的频数为 A i A_i Ai,则有 π i = A i ∑ i = 1 N A i \pi_i=\frac{A_i}{\sum_{i=1}^NA_i} πi=∑i=1NAiAi
设状态为 s j s_j sj,观测序列取值为 o i o_i oi的频数为 B j i B_{ji} Bji, 1 ≤ j ≤ N , 1 ≤ i ≤ M 1 \le j \le N,1 \le i \le M 1≤j≤N,1≤i≤M,则有 b j ( o i ) = B j i ∑ k = 1 M B j k b_j(o_i)=\frac{B_{ji}}{\sum_{k=1}^{M}B_{jk}} bj(oi)=∑k=1MBjkBji
简单介绍一下EM算法,EM算法在数学是证明收敛的
接下来,我们用 i t i_t it表示时刻t状态变量的取值
Q ( λ , λ ^ ) = ∑ I [ I n P ( O , I ∣ λ ) P ( I ∣ O , λ ^ ) ] = ∑ I [ I n P ( O , I ∣ λ ) P ( O , I ∣ λ ^ ) P ( O ∣ λ ^ ) ] Q(\lambda,\hat \lambda)=\sum_I [InP(O,I|\lambda)P(I|O,\hat \lambda)]\\=\sum_I [\frac{InP(O,I|\lambda)P(O,I|\hat \lambda)}{P(O|\hat \lambda)}] Q(λ,λ^)=I∑[InP(O,I∣λ)P(I∣O,λ^)]=I∑[P(O∣λ^)InP(O,I∣λ)P(O,I∣λ^)]
其中 λ \lambda λ是最终得模型, λ ^ \hat \lambda λ^是当前待优化的模型
由于 P ( O ∣ λ ^ ) P(O|\hat \lambda) P(O∣λ^)是一个常数,所以最大化期望时不需要考虑,所以我们只需考虑最大化 ∑ I I n P ( O , I ∣ λ ) P ( O , I ∣ λ ^ ) ( 式 2.0 ) \sum_I InP(O,I|\lambda)P(O,I|\hat \lambda) (式2.0) I∑InP(O,I∣λ)P(O,I∣λ^)(式2.0)
由HMM的概率图模型可知: P ( O , I ∣ λ ) = π i b i 1 ( o 1 ) a i 1 i 2 b i 2 ( o 2 ) . . . . . . a i T − 1 i T b i T ( O T ) ( 式 2.1 ) P(O,I|\lambda)=\pi_ib_{i1}(o_1)a_{i_1i_2}b_{i_2}(o_2)......a_{i_{T-1}i_T}b_{i_T}(O_T) (式2.1) P(O,I∣λ)=πibi1(o1)ai1i2bi2(o2)......aiT−1iTbiT(OT)(式2.1)
将式2.1代入式2.0得 E = ∑ I [ l n π i + ∑ t = 1 T − 1 I n a i t i t + 1 + ∑ t = 1 T I n b i t ( o t ) ] P ( O , I ∣ λ ^ ) = [ ∑ I l n π i + ∑ I ∑ t = 1 T − 1 I n a i t i t + 1 + ∑ I ∑ t = 1 T I n b i t ( o t ) ] P ( O , I ∣ λ ^ ) ( 式 2.2 ) E=\sum_I[ln\pi_i+\sum_{t=1}^{T-1}Ina_{i_ti_{t+1}}+\sum_{t=1}^TInb_{i_t}(o_t)]P(O,I|\hat \lambda)\\=[\sum_Iln\pi_i+\sum_I\sum_{t=1}^{T-1}Ina_{i_ti_{t+1}}+\sum_I\sum_{t=1}^TInb_{i_t}(o_t)]P(O,I|\hat \lambda)(式2.2) E=I∑[lnπi+t=1∑T−1Inaitit+1+t=1∑TInbit(ot)]P(O,I∣λ^)=[I∑lnπi+I∑t=1∑T−1Inaitit+1+I∑t=1∑TInbit(ot)]P(O,I∣λ^)(式2.2)
由于存在约束条件,我们利用拉格朗日乘子法对上式中得参数进行更新,设拉格朗日乘子法的系数为 θ \theta θ
1、对 π i \pi_i πi进行更新,注意到 ∑ i = 1 N π i = 1 \sum_{i=1}^N\pi_i=1 ∑i=1Nπi=1,则有 E + θ ( ∑ i = 1 N π i − 1 ) ( 式 2.3 ) E+\theta(\sum_{i=1}^N\pi_i-1)(式2.3) E+θ(i=1∑Nπi−1)(式2.3)
式2.3对 π i \pi_i πi求导得
P ( O , i 1 = s i ∣ λ ^ ) π i + θ = 0 ( 式 2.4 ) \frac{P(O,i_1=s_i|\hat \lambda)}{\pi_i}+\theta=0(式2.4) πiP(O,i1=si∣λ^)+θ=0(式2.4)
只需要确定 θ \theta θ的值,就可以求得 π i \pi_i πi的更新值,令式2.3依次对 π 1 , π 2 , . . . . , π N \pi_1,\pi_2,....,\pi_N π1,π2,....,πN求导得 P ( O , i 1 = s 1 ∣ λ ^ ) θ = − π 1 P ( O , i 1 = s 2 ∣ λ ^ ) θ = − π 2 . . . . P ( O , i 1 = s N ∣ λ ^ ) θ = − π N \frac{P(O,i_1=s_1|\hat \lambda)}{\theta}=-\pi_1\\\frac{P(O,i_1=s_2|\hat \lambda)}{\theta}=-\pi_2\\....\\\frac{P(O,i_1=s_N|\hat \lambda)}{\theta}=-\pi_N θP(O,i1=s1∣λ^)=−π1θP(O,i1=s2∣λ^)=−π2....θP(O,i1=sN∣λ^)=−πN将上述式子相加,结合 ∑ i = 1 N π i = 1 \sum_{i=1}^N\pi_i=1 ∑i=1Nπi=1,则有 P ( O , I ∣ λ ^ ) = − θ P(O,I|\hat \lambda)=-\theta P(O,I∣λ^)=−θ将其代入式2.4得 π i = P ( O , i 1 = s i ∣ λ ^ ) P ( O , I ∣ λ ^ ) ( 式 2.5 ) \pi_i=\frac{P(O,i_1=s_i|\hat \lambda)}{P(O,I|\hat \lambda)}(式2.5) πi=P(O,I∣λ^)P(O,i1=si∣λ^)(式2.5)
2、对 a i t i t + 1 a_{i_ti_{t+1}} aitit+1进行更新
下列等式比较难理解,也很难用语言表述清楚,在此不过多说明
∑ I ( ∑ t = 1 T − 1 I n a i t i t + 1 ) P ( O , I ∣ λ ^ ) = ∑ i = 1 N ∑ j = 1 N ∑ t = 1 T − 1 I n a i j P ( O , i t = s i , i t + 1 = s j ∣ λ ^ ) \sum_I(\sum_{t=1}^{T-1}Ina_{i_ti_{t+1}})P(O,I|\hat \lambda)=\sum_{i=1}^{N}\sum_{j=1}^{N}\sum_{t=1}^{T-1}Ina_{ij}P(O,i_t=s_i,i_{t+1}=s_j|\hat \lambda) I∑(t=1∑T−1Inaitit+1)P(O,I∣λ^)=i=1∑Nj=1∑Nt=1∑T−1InaijP(O,it=si,it+1=sj∣λ^)
注意到 ∑ j = 1 N a i j = 1 \sum_{j=1}^Na_{ij}=1 ∑j=1Naij=1,则有 E + θ ( ∑ j = 1 N a i j − 1 ) ( 式 2.6 ) E+\theta(\sum_{j=1}^Na_{ij}-1)(式2.6) E+θ(j=1∑Naij−1)(式2.6)
将式2.6对 a i j a_{ij} aij进行求导得 ∑ t = 1 T − 1 P ( O , i t = s i , i t + 1 = s j ∣ λ ^ ) a i j = θ ( 式 2.7 ) \frac{\sum_{t=1}^{T-1}P(O,i_t=s_i,i_{t+1}=s_j|\hat \lambda)}{a_{ij}}=\theta(式2.7) aij∑t=1T−1P(O,it=si,it+1=sj∣λ^)=θ(式2.7)
类似的,我们依次对 a i j , 1 ≤ j ≤ N a_{ij},1 \le j \le N aij,1≤j≤N求导得
∑ t = 1 T − 1 P ( O , i t = s i , i t + 1 = s 1 ∣ λ ^ ) θ = a i 1 ∑ t = 1 T − 1 P ( O , i t = s i , i t + 1 = s 2 ∣ λ ^ ) θ = a i 2 . . . . ∑ t = 1 T − 1 P ( O , i t = s i , i t + 1 = s N ∣ λ ^ ) θ = a i N \frac{\sum_{t=1}^{T-1}P(O,i_t=s_i,i_{t+1}=s_1|\hat \lambda)}{\theta}=a_{i1}\\\frac{\sum_{t=1}^{T-1}P(O,i_t=s_i,i_{t+1}=s_2|\hat \lambda)}{\theta}=a_{i2}\\....\\\frac{\sum_{t=1}^{T-1}P(O,i_t=s_i,i_{t+1}=s_N|\hat \lambda)}{\theta}=a_{iN} θ∑t=1T−1P(O,it=si,it+1=s1∣λ^)=ai1θ∑t=1T−1P(O,it=si,it+1=s2∣λ^)=ai2....θ∑t=1T−1P(O,it=si,it+1=sN∣λ^)=aiN
将上述式子相加,即可得 θ = ∑ t = 1 T − 1 P ( O , i t = s i ∣ λ ^ ) \theta=\sum_{t=1}^{T-1}P(O,i_t=s_i|\hat \lambda) θ=t=1∑T−1P(O,it=si∣λ^)
将上式代入式2.7可得 a i j = ∑ t = 1 T − 1 P ( O , i t = s i , i t + 1 = s j ∣ λ ^ ) ∑ t = 1 T − 1 P ( O , i t = s i ∣ λ ^ ) ( 式 2.8 ) a_{ij}=\frac{\sum_{t=1}^{T-1}P(O,i_t=s_i,i_{t+1}=s_j|\hat \lambda)}{\sum_{t=1}^{T-1}P(O,i_t=s_i|\hat \lambda)}(式2.8) aij=∑t=1T−1P(O,it=si∣λ^)∑t=1T−1P(O,it=si,it+1=sj∣λ^)(式2.8)
3、对观测输出概率进行更新
假设观测序列的取值空间为{ m 1 , m 2 , . . . . . , m M m_1,m_2,.....,m_M m1,m2,.....,mM},观测序列为 o 1 , o 2 , . . . . , o T {o_1,o_2,....,o_T} o1,o2,....,oT,可能有多个时刻的观测序列取值相同,例如时刻1与时刻9的观测序列取值均为 m 3 m_3 m3,即 o 1 = o 9 = m 3 o_1=o_9=m_3 o1=o9=m3,我们对 b j ( m k ) b_j(m_k) bj(mk)进行更新。下列式子的转换较难使用语言表述 ∑ I [ ∑ t = 1 T I n b i t ( o t ) P ( O , I ∣ λ ) ] = ∑ j = 1 N ∑ t = 1 T I n b j ( o t ) P ( O , i t = s j ∣ λ ^ ) \sum_I[\sum_{t=1}^TInb_{i_t}(o_t)P(O,I|\lambda)]=\sum_{j=1}^N\sum_{t=1}^TInb_j(o_t)P(O,i_t=s_j|\hat \lambda) I∑[t=1∑TInbit(ot)P(O,I∣λ)]=j=1∑Nt=1∑TInbj(ot)P(O,it=sj∣λ^)依据 o t o_t ot的取值对上述式子进一步划分,令 { o k = m h } \{o_k=m_h\} {ok=mh}表示满足 o k = m h o_k=m_h ok=mh的所有k组成的集合,则有 ∑ t = 1 T I n b j ( o t ) P ( O , i t = s j ∣ λ ^ ) = I n b j ( m 1 ) ( ∑ k ϵ { o k = m 1 } P ( O , i k = s j ∣ λ ^ ) ) + I n b j ( m 2 ) ( ∑ k ϵ { o k = m 2 } P ( O , i k = s j ∣ λ ^ ) ) + . . . . . + I n b j ( m M ) ( ∑ k ϵ { o k = m M } P ( O , i k = s j ∣ λ ^ ) ) ( 式 2.9 ) \sum_{t=1}^TInb_j(o_t)P(O,i_t=s_j|\hat \lambda)\\=Inb_j(m_1)(\sum_{k\epsilon\{o_k=m_1\}}P(O,i_k=s_j|\hat \lambda))\\+Inb_j(m_2)(\sum_{k\epsilon\{o_k=m_2\}}P(O,i_k=s_j|\hat \lambda))+\\.....\\+Inb_j(m_M)(\sum_{k\epsilon\{o_k=m_M\}}P(O,i_k=s_j|\hat \lambda)) (式2.9) t=1∑TInbj(ot)P(O,it=sj∣λ^)=Inbj(m1)(kϵ{ok=m1}∑P(O,ik=sj∣λ^))+Inbj(m2)(kϵ{ok=m2}∑P(O,ik=sj∣λ^))+.....+Inbj(mM)(kϵ{ok=mM}∑P(O,ik=sj∣λ^))(式2.9)
注意到 ∑ h = 1 M b j ( m h ) = 1 , 1 ≤ h ≤ M \sum_{h=1}^Mb_j(m_h)=1,1 \le h \le M ∑h=1Mbj(mh)=1,1≤h≤M,则有 E + θ ( ∑ h = 1 M b j ( m h ) − 1 ) ( 式 3.0 ) E+\theta(\sum_{h=1}^Mb_j(m_h)-1)(式3.0) E+θ(h=1∑Mbj(mh)−1)(式3.0)
将上式对 b j ( m h ) b_j(m_h) bj(mh)求导得到(结合式2.9),可得 ( ∑ k ϵ { o k = m h } P ( O , i k = s j ∣ λ ^ ) ) b j ( m h ) − θ = 0 ( 式 3.1 ) \frac{(\sum_{k\epsilon\{o_k=m_h\}}P(O,i_k=s_j|\hat \lambda))}{b_j(m_h)}-\theta=0(式3.1) bj(mh)(∑kϵ{ok=mh}P(O,ik=sj∣λ^))−θ=0(式3.1)
式3.0依次对 b j ( m h ) , 1 ≤ h ≤ M b_j(m_h) ,1 \le h \le M bj(mh),1≤h≤M求导得 ( ∑ k ϵ { o k = m 1 } P ( O , i k = s j ∣ λ ^ ) ) θ = b j ( m 1 ) ( ∑ k ϵ { o k = m 2 } P ( O , i k = s j ∣ λ ^ ) ) θ = b j ( m 2 ) . . . . . . ( ∑ k ϵ { o k = m M } P ( O , i k = s j ∣ λ ^ ) ) θ = b j ( m M ) \frac{(\sum_{k\epsilon\{o_k=m_1\}}P(O,i_k=s_j|\hat \lambda))}{\theta}=b_j(m_1)\\\frac{(\sum_{k\epsilon\{o_k=m_2\}}P(O,i_k=s_j|\hat \lambda))}{\theta}=b_j(m_2)\\......\\\frac{(\sum_{k\epsilon\{o_k=m_M\}}P(O,i_k=s_j|\hat \lambda))}{\theta}=b_j(m_M) θ(∑kϵ{ok=m1}P(O,ik=sj∣λ^))=bj(m1)θ(∑kϵ{ok=m2}P(O,ik=sj∣λ^))=bj(m2)......θ(∑kϵ{ok=mM}P(O,ik=sj∣λ^))=bj(mM)
将上述式子全部相加,注意到一个时刻,一个状态变量取值对应一个观测变量的取值,所以有 ∑ k ϵ { o k = m 1 } P ( O , i k = s j ∣ λ ^ ) + ∑ k ϵ { o k = m 2 } P ( O , i k = s j ∣ λ ^ ) + . . . . . + ∑ k ϵ { o k = m M } P ( O , i k = s j ∣ λ ^ ) = ∑ t = 1 T P ( O , i t = s j ∣ λ ^ ) \sum_{k\epsilon\{o_k=m_1\}}P(O,i_k=s_j|\hat \lambda)\\+\sum_{k\epsilon\{o_k=m_2\}}P(O,i_k=s_j|\hat \lambda)+\\.....\\+\sum_{k\epsilon\{o_k=m_M\}}P(O,i_k=s_j|\hat \lambda)=\sum_{t=1}^TP(O,i_t=s_j|\hat \lambda) kϵ{ok=m1}∑P(O,ik=sj∣λ^)+kϵ{ok=m2}∑P(O,ik=sj∣λ^)+.....+kϵ{ok=mM}∑P(O,ik=sj∣λ^)=t=1∑TP(O,it=sj∣λ^)
则有 ∑ t = 1 T P ( O , i t = s j ∣ λ ^ ) = θ \sum_{t=1}^TP(O,i_t=s_j|\hat \lambda)=\theta t=1∑TP(O,it=sj∣λ^)=θ
将上式代入式3.1得到 b j ( m h ) = ( ∑ k ϵ { o k = m h } P ( O , i k = s j ∣ λ ^ ) ) ∑ t = 1 T P ( O , i t = s j ∣ λ ^ ) b_j(m_h)=\frac{(\sum_{k\epsilon\{o_k=m_h\}}P(O,i_k=s_j|\hat \lambda))}{\sum_{t=1}^TP(O,i_t=s_j|\hat \lambda)} bj(mh)=∑t=1TP(O,it=sj∣λ^)(∑kϵ{ok=mh}P(O,ik=sj∣λ^))
利用HMM的概率图模型,我们可知 P ( O , i t = s j ∣ λ ^ ) = α t ( j ) β t ( j ) P ( O , i t = s i , i t + 1 = s j ∣ λ ^ ) = α t ( i ) a i j b j ( o t + 1 ) β t ( j ) P ( O ∣ λ ) = ∑ i = 1 n α t ( i ) β t ( i ) P(O,i_t=s_j|\hat \lambda)=\alpha_t(j)\beta_t(j)\\P(O,i_t=s_i,i_{t+1}=s_j|\hat \lambda)=\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_t(j)\\P(O|\lambda)=\sum_{i=1}^n\alpha_t(i)\beta_t(i) P(O,it=sj∣λ^)=αt(j)βt(j)P(O,it=si,it+1=sj∣λ^)=αt(i)aijbj(ot+1)βt(j)P(O∣λ)=i=1∑nαt(i)βt(i)
则上述式子均可以求解更新
HMM模型用于时序数据建模,在自然语言、语音识别中有广泛运用,我的理解是在自然语言、语音识别中,可用于中文分词
1、《机器学习》 周志华
2、隐马尔可夫模型