【自然语言处理】条件随机场【Ⅲ】条件随机场估计问题

有任何的书写错误、排版错误、概念错误等,希望大家包含指正。

部分推导和定义相关的佐证资料比较少,供参考。讨论的过程中我会加入自己的理解,难免存在错误,欢迎大家讨论。

在阅读本篇之前建议先学习:
隐马尔可夫模型系列
最大熵马尔可夫模型

由于字数限制,分成五篇博客。
【自然语言处理】条件随机场【Ⅰ】马尔可夫随机场
【自然语言处理】条件随机场【Ⅱ】条件随机场概述
【自然语言处理】条件随机场【Ⅲ】条件随机场估计问题
【自然语言处理】条件随机场【Ⅳ】条件随机场学习问题
【自然语言处理】条件随机场【Ⅴ】条件随机场解码问题

3.6. 估计问题

条件随机场的估计问题是给定条件随机场 P ( Y ∣ X ) P(Y\mid X) P(YX),输入序列 x x x 和输出序列 y y y,计算条件概率 P ( Y i = y i ∣ x ) P(Y_i=y_i\mid x) P(Yi=yix) P ( Y i − 1 = y i − 1 , Y i = y i ∣ x ) P(Y_{i-1}=y_{i-1},Y_i = y_i\mid x) P(Yi1=yi1,Yi=yix) 以及相应的数学期望的问题。方便起见,像隐马尔科夫模型那样,引入前向向量和后向向量,递推地计算以上概率及期望。

3.6.1. 前向向量与后向向量

定义 N N N 维前向向量 α t \alpha_t αt t = 0 , 1 , … , T + 1 t=0,1,\dots, T+1 t=0,1,,T+1
α t = ( α t ( 1 ) α t ( 2 ) … α t ( N ) ) T \alpha_t = (\begin{matrix} \alpha_t(1)& \alpha_t(2)& \dots& \alpha_t(N) \end{matrix})^T αt=(αt(1)αt(2)αt(N))T
其中 α t ( i ) \alpha_t(i) αt(i) 定义为
α 0 ( i ) = { 1 , i = s t a r t 0 , otherwise \alpha_0(i) = \left\{\begin{array}{l} 1,& i = {\rm start} \\ 0, & \textbf{otherwise} \end{array}\right. α0(i)={1,0,i=startotherwise

α t ( i ) = ∑ y 0 , y 1 , … , y t − 1 D ( y 0 , … , y t − 1 , y t = q i ∣ x , w ) = Z ( x ) ∑ y 0 , y 1 , … , y t − 1 P ( y 0 , … , y t − 1 , y t = q i ∣ x , w ) ,      t = 1 , 2 , … , T + 1 \begin{align} \alpha_t(i) &= \sum_{y_{0}, y_{1}, \dots, y_{t-1}} D (y_0, \dots, y_{t-1},y_t=q_i\mid x, w) \notag \\ &= Z(x)\sum_{y_{0}, y_{1}, \dots, y_{t-1}} P (y_0, \dots, y_{t-1},y_t=q_i\mid x, w),\space\space\space\space t=1,2,\dots, T+1 \notag \\ \end{align} αt(i)=y0,y1,,yt1D(y0,,yt1,yt=qix,w)=Z(x)y0,y1,,yt1P(y0,,yt1,yt=qix,w),    t=1,2,,T+1

其中 D D D 为非规范化概率。递推公式为

α t ( i ) = Z ( x ) ∑ y 0 , y 1 , … , y t − 1 P ( y 0 , … , y t − 1 , y t = q i ∣ x , w ) = Z ( x ) ∑ j = 1 N ∑ y 0 , y 1 , … , y t − 2 P ( y 0 , … , y t − 1 = q j , y t = q i ∣ x , w ) = Z ( x ) ∑ j = 1 N ∑ y 0 , y 1 , … , y t − 2 P ( y t = q i ∣ y 0 , … , y t − 1 = q j , x , w ) P ( y 0 , … , y t − 1 = q j ∣ x , w ) = Z ( x ) ∑ j = 1 N P ( y t = q i ∣ y t − 1 = q j , x , w ) ∑ y 0 , y 1 , … , y t − 2 P ( y 0 , … , y t − 1 = q j ∣ x , w ) = ∑ j = 1 N M t ( y t − 1 = q j , y t = q i ) α t − 1 ( j ) ,      t = 1 , 2 , … , T + 1 \begin{align} \alpha_t(i) &= Z(x)\sum_{y_{0}, y_{1}, \dots, y_{t-1}} P (y_0, \dots, y_{t-1},y_t=q_i\mid x, w) \notag \\ &= Z(x)\sum_{j=1}^N \sum_{y_{0}, y_{1}, \dots, y_{t-2}} P (y_0, \dots, y_{t-1}=q_j,y_t=q_i\mid x, w) \notag \\ &= Z(x)\sum_{j=1}^N \sum_{y_{0}, y_{1}, \dots, y_{t-2}} P(y_t = q_i\mid y_0, \dots, y_{t-1} = q_j, x, w) P(y_0, \dots, y_{t-1}=q_j\mid x, w) \notag \\ &= Z(x)\sum_{j=1}^N P(y_t = q_i\mid y_{t-1} = q_j, x, w) \sum_{y_{0}, y_{1}, \dots, y_{t-2}} P(y_0, \dots, y_{t-1}=q_j\mid x, w) \notag \\ &= \sum_{j=1}^N M_{t}(y_{t-1} = q_j, y_t = q_i) \alpha_{t-1}(j), \space\space\space\space t=1,2,\dots, T+1 \tag{8}\\ \end{align} αt(i)=Z(x)y0,y1,,yt1P(y0,,yt1,yt=qix,w)=Z(x)j=1Ny0,y1,,yt2P(y0,,yt1=qj,yt=qix,w)=Z(x)j=1Ny0,y1,,yt2P(yt=qiy0,,yt1=qj,x,w)P(y0,,yt1=qjx,w)=Z(x)j=1NP(yt=qiyt1=qj,x,w)y0,y1,,yt2P(y0,,yt1=qjx,w)=j=1NMt(yt1=qj,yt=qi)αt1(j),    t=1,2,,T+1(8)
α t ( i ) \alpha_t(i) αt(i) 构成向量 α t \alpha_t αt 得矩阵形式的递推公式
α t T = α t − 1 T M t (9) \alpha_t^T = \alpha_{t-1}^T M_t \tag{9} αtT=αt1TMt(9)
递推公式推导的最后一步,后半部分根据的定义转化为 α t − 1 ( j ) \alpha_{t-1}(j) αt1(j),前半部分转化为非规范化概率。对于前半部分的转化,我们从概率(推导)的角度来看,联合概率(以观测序列 x x x 为条件的前提下)可以根据图结构所展现的条件独立性转化为条件概率的乘积 P ( y ∣ x ) = P ( y 0 ) P ( y 1 ∣ y 0 ) P ( y 2 ∣ y 1 ) … P ( y T + 1 ∣ y T ) P(y\mid x) = P(y_0)P(y_1\mid y_0) P(y_2\mid y_1) \dots P(y_{T+1}\mid y_T) P(yx)=P(y0)P(y1y0)P(y2y1)P(yT+1yT),这种拆分方式对应于前向概率的递推;从能量角度或者势函数角度来看,联合概率由各团对应势函数相乘得到。在 CRF 中,两个相邻状态构成一个团,不区分顺序,所以联合概率表示为 P ( y ∣ x ) = ψ 1 ( y 0 , y 1 ) ψ 2 ( y 1 , y 2 ) … , ψ T + 1 ( y T , y T + 1 ) P(y\mid x)=\psi_1(y_0, y_1)\psi_2(y_1, y_2)\dots, \psi_{T+1}(y_T,y_{T+1}) P(yx)=ψ1(y0,y1)ψ2(y1,y2),ψT+1(yT,yT+1),其中 ψ t ( y t − 1 , y t ) \psi_t(y_{t-1}, y_t) ψt(yt1,yt) 对应于 M t ( y t − 1 , y t ) M_t(y_{t-1}, y_t) Mt(yt1,yt)。可见,概率角度的条件概率与能量角度的势函数是一致的,注意势函数是非规范化概率,二者相差倍数 Z ( x ) Z(x) Z(x)

类似地定义后向向量 β t \beta_t βt t = 0 , 1 , … , T + 1 t=0,1,\dots, T+1 t=0,1,,T+1
β t = ( β t ( 1 ) β t ( 2 ) … β t ( N ) ) T \beta_t = (\begin{matrix} \beta_t(1)& \beta_t(2)& \dots& \beta_t(N) \end{matrix})^T βt=(βt(1)βt(2)βt(N))T
其中 β t ( i ) \beta_t(i) βt(i) 定义为
β T + 1 ( i ) = { 1 , i = s t o p 0 , otherwise \beta_{T+1}(i) = \left\{\begin{array}{l} 1,& i = {\rm stop} \\ 0, & \textbf{otherwise} \end{array}\right. βT+1(i)={1,0,i=stopotherwise

β t ( i ) = ∑ y t + 1 , … , y T + 1 D ( y t = q i , y t + 1 , … , y T + 1 ∣ x , w ) = Z ( x ) ∑ y t + 1 , … , y T + 1 P ( y t = q i , y t + 1 , … , y T + 1 ∣ x , w ) ,      t = 0 , 1 , … , T \begin{align} \beta_t(i) &= \sum_{y_{t+1},\dots, y_{T+1}} D(y_t=q_i, y_{t+1}, \dots, y_{T+1}\mid x, w) \notag \\ &= Z(x) \sum_{y_{t+1},\dots, y_{T+1}} P(y_t=q_i, y_{t+1}, \dots, y_{T+1}\mid x, w),\space\space\space\space t=0,1, \dots, T \notag \end{align} βt(i)=yt+1,,yT+1D(yt=qi,yt+1,,yT+1x,w)=Z(x)yt+1,,yT+1P(yt=qi,yt+1,,yT+1x,w),    t=0,1,,T

递推公式为
β t ( i ) = Z ( x ) ∑ j = 1 N ∑ y t + 1 , … , y T + 1 P ( y t = q i , y t + 1 = q j , … , y T + 1 ∣ x , w ) = Z ( x ) ∑ j = 1 N P ( y t = q i ∣ y t + 1 = q j , x , w ) β t ( j ) = ∑ j = 1 N M t + 1 ( y t = q i , y t + 1 = q j ) β t + 1 ( j ) ,      t = 0 , 1 , … , T \begin{align} \beta_t(i) &= Z(x) \sum_{j=1}^N\sum_{y_{t+1},\dots, y_{T+1}} P(y_t=q_i, y_{t+1}=q_j, \dots, y_{T+1}\mid x, w)\notag \\ &= Z(x) \sum_{j=1}^N P(y_t = q_i\mid y_{t+1} = q_j, x, w) \beta_t(j) \notag\\ &= \sum_{j=1}^N M_{t+1} (y_t = q_i, y_{t+1} = q_j) \beta_{t+1}(j) ,\space\space\space\space t=0,1,\dots, T\tag{10} \end{align} βt(i)=Z(x)j=1Nyt+1,,yT+1P(yt=qi,yt+1=qj,,yT+1x,w)=Z(x)j=1NP(yt=qiyt+1=qj,x,w)βt(j)=j=1NMt+1(yt=qi,yt+1=qj)βt+1(j),    t=0,1,,T(10)
β t ( i ) \beta_t(i) βt(i) 构成向量 β t \beta_t βt 得矩阵形式的递推公式
β t = M t + 1 β t + 1 (11) \beta_t = M_{t+1}\beta_{t+1} \tag{11} βt=Mt+1βt+1(11)

后向向量推导与前向向量推导类似。由于无向图不区分条件方向,所以联合概率也可以表示为 P ( y ∣ x ) = P ( y T + 1 ) P ( y T ∣ y T + 1 ) … P ( y 0 ∣ y 1 ) P(y\mid x) = P(y_{T+1})P(y_T\mid y_{T+1}) \dots P(y_0\mid y_1) P(yx)=P(yT+1)P(yTyT+1)P(y0y1),这种拆分方式对应后向概率的递推。同时,两种拆分方式对应着在递推时右乘矩阵 M t M_t Mt 和左乘矩阵 M t M_t Mt

3.6.2. 概率计算

按照前向向量和后向向量的定义,很容易计算状态序列在时刻 t t t 为状态 q i q_i qi 的条件概率:
P ( y t = q i ∣ x ) = α t ( i ) β t ( i ) Z ( x ) (12) P(y_t = q_i\mid x) = \frac{\alpha_t(i)\beta_t(i)}{Z(x)}\tag{12} P(yt=qix)=Z(x)αt(i)βt(i)(12)
根据式 ( 8 ) (8) (8) 和式 ( 10 ) (10) (10) 可以将上式化为
P ( y t = q i ∣ x ) = α t ( i ) β t ( i ) Z ( x ) = 1 Z ( x ) ( ∑ n = 1 N ∑ m = 1 N ⋯ ∑ k = 1 N ∑ j = 1 N M 1 ( y 0 = q n , y 1 = q m ) M 2 ( y 1 = q m , y 2 = q l ) … M t − 1 ( y t − 2 = q k , y t − 1 = q j ) M t ( y t − 1 = q j , y t = q i ) )              ⋅ ( ∑ j = 1 N ∑ k = 1 N ⋯ ∑ m = 1 N ∑ n = 1 N M t + 1 ( y t = q i , y t + 1 = q j ) M t + 2 ( y t + 1 = q j , y t + 2 = q k ) … M T ( y T − 1 = q l , y T = q m ) M T + 1 ( y T = q m , y T + 1 = q n ) ) = 1 Z ( x ) ( ∑ y 0 , … , y t − 1 M 1 ( y 0 , y 1 ) … M t − 1 ( y t − 2 , y t − 1 ) M t ( y t − 1 , y t = q i ) ) ⋅ ( ∑ y t + 1 , … , y T + 1 M t + 1 ( y t = q i , y t + 1 ) M t + 2 ( y t + 1 , y t + 2 ) … M T + 1 ( y T , y T + 1 ) ) = 1 Z ( x ) ∑ y 0 , … , y t − 1 , y t + 1 , … , y T + 1 M 1 ( y 0 , y 1 ) … M t ( y t − 1 , y t = q i ) M t + 1 ( y t = q i , y t + 1 ) … M T + 1 ( y T , y T + 1 ) \begin{align} P(y_t = q_i\mid x) &= \frac{\alpha_t(i)\beta_t(i)}{Z(x)} \notag \\ &= \frac{1}{Z(x)}\left(\sum_{n=1}^N \sum_{m=1}^N \dots \sum_{k=1}^N\sum_{j=1}^N M_1(y_0 = q_n, y_1 = q_m)M_2(y_1 = q_m, y_2 = q_l)\dots M_{t-1}(y_{t-2} = q_k, y_{t-1} = q_j) M_t(y_{t-1}=q_j, y_t = q_i)\right) \notag\\ &\space\space\space\space\space\space\space\space\space\space\space\space ·\left( \sum_{j=1}^N \sum_{k=1}^N \dots \sum_{m=1}^N\sum_{n=1}^N M_{t+1}(y_{t}=q_i, y_{t+1} = q_j) M_{t+2}(y_{t+1} = q_j, y_{t+2} = q_k) \dots M_{T}(y_{T-1} = q_l, y_T = q_m) M_{T+1}(y_T = q_m, y_{T+1}= q_n) \right) \notag \\ &= \frac{1}{Z(x)} \left(\sum_{y_0,\dots,y_{t-1}} M_1(y_0,y_1)\dots M_{t-1}(y_{t-2}, y_{t-1}) M_t(y_{t-1}, y_t = q_i)\right)·\left(\sum_{y_{t+1},\dots,y_{T+1}} M_{t+1}(y_t=q_i,y_{t+1})M_{t+2}(y_{t+1}, y_{t+2})\dots M_{T+1}(y_{T}, y_{T+1})\right) \notag\\ &= \frac{1}{Z(x)} \sum_{y_0,\dots,y_{t-1},y_{t+1},\dots,y_{T+1}} M_1(y_0,y_1)\dots M_t(y_{t-1}, y_t = q_i) M_{t+1}(y_t=q_i,y_{t+1}) \dots M_{T+1}(y_{T}, y_{T+1}) \notag \end{align} P(yt=qix)=Z(x)αt(i)βt(i)=Z(x)1(n=1Nm=1Nk=1Nj=1NM1(y0=qn,y1=qm)M2(y1=qm,y2=ql)Mt1(yt2=qk,yt1=qj)Mt(yt1=qj,yt=qi))            (j=1Nk=1Nm=1Nn=1NMt+1(yt=qi,yt+1=qj)Mt+2(yt+1=qj,yt+2=qk)MT(yT1=ql,yT=qm)MT+1(yT=qm,yT+1=qn))=Z(x)1(y0,,yt1M1(y0,y1)Mt1(yt2,yt1)Mt(yt1,yt=qi))(yt+1,,yT+1Mt+1(yt=qi,yt+1)Mt+2(yt+1,yt+2)MT+1(yT,yT+1))=Z(x)1y0,,yt1,yt+1,,yT+1M1(y0,y1)Mt(yt1,yt=qi)Mt+1(yt=qi,yt+1)MT+1(yT,yT+1)
其中 M t ( y t − 1 , y t ) M_t(y_{t-1},y_t) Mt(yt1,yt) 均为 N N N 阶矩阵 M t M_t Mt 的一个元素。

在时刻 t − 1 t-1 t1 t t t 为状态 q i q_i qi q j q_j qj 的条件概率:
P ( y t − 1 = q i , y t = q j ∣ x ) = α t − 1 ( i ) M t ( y t − 1 = q i , y t = q j ) β t ( j ) Z ( x ) (13) P(y_{t-1}=q_i, y_t = q_j\mid x) = \frac{\alpha_{t-1}(i) M_t(y_{t-1}=q_i, y_t=q_j)\beta_t(j)}{Z(x)} \tag{13} P(yt1=qi,yt=qjx)=Z(x)αt1(i)Mt(yt1=qi,yt=qj)βt(j)(13)
同样地,利用式 ( 8 ) (8) (8) 和式 ( 10 ) (10) (10) 将上式化为
P ( y t − 1 = q i , y t = q j ∣ x ) = 1 Z ( x ) ∑ y 0 , … , y t − 2 , y t + 1 , … , y T + 1 M 1 ( y 0 , y 1 ) … M t − 1 ( y t − 2 , y t − 1 = q i ) M t ( y t − 1 = q i , y t = q j ) M t + 1 ( y t = q j , y t + 1 ) … M T + 1 ( y T , y T + 1 ) P(y_{t-1}=q_i, y_t = q_j\mid x) = \frac{1}{Z(x)} \sum_{y_0,\dots,y_{t-2},y_{t+1},\dots,y_{T+1}} M_1(y_0,y_1)\dots M_{t-1}(y_{t-2},y_{t-1}=q_i)M_t(y_{t-1}=q_i, y_t = q_j) M_{t+1}(y_t=q_j,y_{t+1}) \dots M_{T+1}(y_{T}, y_{T+1}) P(yt1=qi,yt=qjx)=Z(x)1y0,,yt2,yt+1,,yT+1M1(y0,y1)Mt1(yt2,yt1=qi)Mt(yt1=qi,yt=qj)Mt+1(yt=qj,yt+1)MT+1(yT,yT+1)

其中,
Z ( x ) = α T T 1 = 1 T β 1 Z(x) = \alpha_T^T\textbf{1} = \textbf{1}^T\beta_1 Z(x)=αTT1=1Tβ1
1 \textbf{1} 1 为元素均为 1 1 1 N N N 维列向量。

3.6.3. 期望计算

利用前向向量和后向向量,可以计算特征函数关于联合分布 P ( X , Y ) P(X, Y) P(X,Y) 和条件分布 P ( Y ∣ X ) P(Y\mid X) P(YX) 的数学期望。

特征函数 f k ( x , y ) f_k(x,y) fk(x,y) 关于条件分布 P ( Y ∣ X ) P(Y\mid X) P(YX) 的数学期望是
E P ( Y ∣ X ) [ f k ] = ∑ y P ( y ∣ x ) f k ( y , x ) = ∑ t = 1 T + 1 ∑ y t − 1 , y t P ( y t − 1 , y t ∣ x ) f k ( y t − 1 , y t , x , t ) = ∑ t = 1 T + 1 ∑ i , j f k ( y t − 1 = q i , y t = q j , x , t ) α t − 1 ( i ) M t ( y t − 1 = q i , y t = q j ) β t ( j ) α T T 1 ,      k = 1 , 2 , … , K \begin{align} E_{P(Y\mid X)} [f_k] &= \sum_{y} P(y\mid x) f_k(y, x) \notag \\ &=\sum\limits_{t=1}^{T+1} \sum\limits_{y_{t-1},y_t}P(y_{t-1},y_t\mid x)f_k(y_{t-1},y_t,x, t) \notag\\ &= \sum\limits_{t=1}^{T+1} \sum\limits_{i,j}f_k(y_{t-1}=q_i,y_t=q_j,x, t)\frac{\alpha_{t-1}(i) M_t(y_{t-1}=q_i, y_t=q_j)\beta_t(j)}{\alpha^T_T\textbf{1}},\space\space\space\space k=1,2,\dots, K \tag{14} \end{align} EP(YX)[fk]=yP(yx)fk(y,x)=t=1T+1yt1,ytP(yt1,ytx)fk(yt1,yt,x,t)=t=1T+1i,jfk(yt1=qi,yt=qj,x,t)αTT1αt1(i)Mt(yt1=qi,yt=qj)βt(j),    k=1,2,,K(14)
假设经验分布为 P ~ ( X ) \tilde P(X) P~(X),特征函数 f k f_k fk 关于联合分布 P ( X , Y ) P(X,Y) P(X,Y) 的数学期望是
E P ( x , y ) [ f k ] = ∑ x , y P ( x , y ) ∑ t = 1 T + 1 f k ( y t − 1 , y t , x , t ) = ∑ x P ~ ( x ) ∑ y P ( y ∣ x ) ∑ t = 1 T + 1 f k ( y t − 1 , y t , x , t ) = ∑ x P ~ ( x ) ∑ t = 1 T + 1 ∑ i , j f k ( y t − 1 = q i , y t = q j , x , t ) α t − 1 ( i ) M t ( y t − 1 = q i , y t = q j ∣ x ) β t ( j ) α T T 1 ,      k = 1 , 2 , … , K \begin{align} E_{P(x,y)}[f_k] & = \sum\limits_{x,y}P(x,y) \sum\limits_{t=1}^{T+1}f_k(y_{t-1},y_t,x, t) \notag \\ & = \sum\limits_{x}\tilde{P}(x) \sum\limits_{y}P(y\mid x) \sum\limits_{t=1}^{T+1}f_k(y_{t-1},y_t,x, t) \notag\\ & = \sum\limits_{x}\tilde{P}(x)\sum\limits_{t=1}^{T+1} \sum\limits_{i,j}f_k(y_{t-1}=q_i,y_t=q_j,x, t) \frac{\alpha_{t-1}(i)M_t(y_{t-1}=q_i,y_t=q_j\mid x)\beta_t(j)}{ \alpha_{T}^T \mathbf{1}},\space\space\space\space k=1,2,\dots, K \tag{15} \end{align} EP(x,y)[fk]=x,yP(x,y)t=1T+1fk(yt1,yt,x,t)=xP~(x)yP(yx)t=1T+1fk(yt1,yt,x,t)=xP~(x)t=1T+1i,jfk(yt1=qi,yt=qj,x,t)αTT1αt1(i)Mt(yt1=qi,yt=qjx)βt(j),    k=1,2,,K(15)
( 12 ) (12) (12) 和式 ( 13 ) (13) (13) 是特征函数数学期望的一般计算公式。对于转移特征 t i ( y t − 1 , y t , x , t ) t_i(y_{t-1}, y_t, x, t) ti(yt1,yt,x,t) i = 1 , 2 , … , K 1 i=1,2,\dots, K_1 i=1,2,,K1,可以将式中的 f k f_k fk 换成 t i t_i ti;对于状态特征,可以将式中的 f k f_k fk 换成 s j s_j sj,表示 s j ( y t , x , t ) s_j(y_t, x, t) sj(yt,x,t) j = 1 , 2 , … , K 2 j=1,2,\dots, K_2 j=1,2,,K2

有了式 ( 12 ) ∼ ( 15 ) (12)\sim (15) (12)(15),对于给定的观测序列 x x x 和状态序列 y y y,可以通过依次前向扫描计算 α t \alpha_t αt Z ( x ) Z(x) Z(x),通过一次后向扫描计算 β t \beta_t βt,从而计算所有的概率和特征的期望。

你可能感兴趣的:(【机器学习】,【自然语言处理】,自然语言处理,人工智能,算法,概率论)