在此之前可以先看完再看看这位的博客
Bi-LSTM-CRF for Sequence Labeling PENG
首先是定义 S ( X , y ~ ) S(X,\tilde{y}) S(X,y~)是对于输入序列 X X X对应的输出tag序列 y y y的分数:
S ( X , y ~ ) = ∑ i = 0 n − 1 A y i , y i + 1 + ∑ i = 0 n P i , y i , n 为 序 列 长 度 S(X,\tilde{y})=\sum_{i=0}^{n-1}A_{y_i,y_{i+1}}+\sum_{i=0}^nP_{i,y_i} ,n为序列长度 S(X,y~)=i=0∑n−1Ayi,yi+1+i=0∑nPi,yi,n为序列长度
其中 A A A是tag的转移概率矩阵, P P P是Bi-LISTM的输出矩阵,其中 P i , j P_{i,j} Pi,j代表词 w i w_i wi映射到 t a g j tag_j tagj的非归一化概率。
在pytorch官网提供的代码中,不是先求出每一种路径的分数在求和,而是利用前向算法的思想来求解:
对于求到序列i位置的logsumexp分数,可以先求出到序列i-1的logsumexp分数,在加上转移概率和Bi-LSTM的
输出的概率。定义 f ( i , s j ) = l o g ( ∑ y ~ ∈ Y i , y i = s j e S i ( X , y ~ , s j ) ) f(i,s_j)=log(\sum_{\tilde{y} \in Y_i ,{y_i=s_j}} e^{S_i(X,\tilde{y},s_j)}) f(i,sj)=log(∑y~∈Yi,yi=sjeSi(X,y~,sj))表示到序列长度i,最后的状态为 s j s_j sj的所有路径logsumexp分数和。所以要构造一个函数:
f ( i , s j ) = F ( f ( i − 1 , s 1 ) , f ( i − 1 , s 2 ) , . . . , f ( i − 1 , s T ) ) = F ( i − 1 ) , T 表 示 状 态 个 数 f(i,s_j)=F(f(i-1,s_1),f(i-1,s_2),...,f(i-1,s_T))=F(i-1),T表示状态个数 f(i,sj)=F(f(i−1,s1),f(i−1,s2),...,f(i−1,sT))=F(i−1),T表示状态个数。
可以利用公式 l o g ( ∑ e l o g ( ∑ e x ) + y ) = l o g ( ∑ ∑ e x + y ) log(\sum e^{log(\sum e^x)+y})=log(\sum\sum e^{x+y}) log(∑elog(∑ex)+y)=log(∑∑ex+y)(这个公式自己动手推一下是
成立的),来计算 f ( i , s j ) f(i,s_j) f(i,sj)和 f ( i − 1 , s t ) f(i-1,s_t) f(i−1,st)的关系。
l o g ( ∑ t = 1 T e f ( i − 1 , s t ) + A s t , s j + P i , j ) = l o g ( ∑ t = 1 T e A s t , s j + P i , j ∗ e f ( i − 1 , s t ) ) = l o g ( ∑ t = 1 T e A s t , s j + P i , j ∗ e l o g ( ∑ ( y ~ ∈ Y i − 1 , y i − 1 = s t ) e S i − 1 ( X , y ~ , s t ) ) ) = l o g ( ∑ t = 1 T e A s t , s j + P i , j ∗ ∑ y ~ ∈ Y i − 1 , y i − 1 = s t e S i − 1 ( X , y ~ , s t ) ) = l o g ( ∑ t = 1 T ∑ y ~ ∈ Y i − 1 , y i − 1 = s t e S i − 1 ( X , y ~ , s t ) + A s t , s j + P i , j ) = l o g ( ∑ t = 1 T ∑ y ~ ∈ Y i , y i − 1 = s t , y i = s j e S i ( X , y ~ , s j ) ) = l o g ( ∑ y ~ ∈ Y i , y i = s j e S i ( X , y ~ , s j ) ) = f ( i , s j ) \begin{aligned} &log(\sum _{t=1}^{T}e^{f(i-1,s_t)+A_{s_t,s_j}+P_{i,j}}) \\ &=log(\sum _{t=1}^{T}e^{A_{s_t,s_j}+P_{i,j}}*e^{f(i-1,s_t)}) \\ &=log(\sum _{t=1}^{T}e^{A_{s_t,s_j}+P_{i,j}}*e^{log(\sum_{(\tilde{y} \in Y_{i-1},y_{i-1}=s_t)} e^{S_{i-1}(X,\tilde{y},s_t)})}) \\ &=log(\sum _{t=1}^{T}e^{A_{s_t,s_j}+P_{i,j}}* \sum_{\tilde{y} \in Y_{i-1},y_{i-1}=s_t} e^{S_{i-1}(X,\tilde{y},s_t)}) \\ &=log(\sum _{t=1}^{T}\sum_{\tilde{y} \in Y_{i-1},y_{i-1}=s_t} e^{S_{i-1}(X,\tilde{y},s_t)+A_{s_t,s_j}+P_{i,j}}) \\ &=log(\sum _{t=1}^{T}\sum_{\tilde{y} \in Y_{i},y_{i-1}=s_t,y_i=s_j} e^{S_{i}(X,\tilde{y},s_j)}) \\ &=log(\sum_{\tilde{y} \in Y_{i},y_i=s_j} e^{S_{i}(X,\tilde{y},s_j)}) \\ &=f(i,s_j) \end{aligned} log(t=1∑Tef(i−1,st)+Ast,sj+Pi,j)=log(t=1∑TeAst,sj+Pi,j∗ef(i−1,st))=log(t=1∑TeAst,sj+Pi,j∗elog(∑(y~∈Yi−1,yi−1=st)eSi−1(X,y~,st)))=log(t=1∑TeAst,sj+Pi,j∗y~∈Yi−1,yi−1=st∑eSi−1(X,y~,st))=log(t=1∑Ty~∈Yi−1,yi−1=st∑eSi−1(X,y~,st)+Ast,sj+Pi,j)=log(t=1∑Ty~∈Yi,yi−1=st,yi=sj∑eSi(X,y~,sj))=log(y~∈Yi,yi=sj∑eSi(X,y~,sj))=f(i,sj)
l o g s u m e x p i , j = l o g ( ∑ t = 1 T e ( l o g s u m e x p i − 1 , t + A i , j + P i , t ) ) logsumexp_{i,j}=log(\sum_{t=1}^Te^{(logsumexp_{i-1,t}+A_{i,j}+P_{i,t})}) logsumexpi,j=log(t=1∑Te(logsumexpi−1,t+Ai,j+Pi,t))
位 置 i 的 词 w i 的 对 应 的 l a b e l 为 t a g j 的 l o g s u m e x p 分 数 = l o g ( ∑ t = 1 T e ( 位 置 i − 1 的 标 签 为 t a g t 的 l o g s u m e x p 分 数 + 标 签 t a g t 到 t a g j 的 转 移 概 率 + b i l s t m 的 输 出 矩 阵 中 词 w i − 1 映 射 到 t a g t 的 概 率 ) ) \begin{aligned} 位置i的词w_i的对应的label为tag_j的logsumexp分数= log(\sum_{t=1}^Te^{(位置i-1的标签为tag_t的logsumexp分数+ 标签tag_t到tag_j的转移概率+bilstm的输出矩阵中词w_{i-1}映射到tag_t的概率)}) \end{aligned} 位置i的词wi的对应的label为tagj的logsumexp分数=log(t=1∑Te(位置i−1的标签为tagt的logsumexp分数+标签tagt到tagj的转移概率+bilstm的输出矩阵中词wi−1映射到tagt的概率))
最后,长度为n的序列logsumexp分数为 F i n a l ( n ) = ∑ t = 1 T f ( n , t ) + A t , S s t o p Final(n)=\sum_{t=1}^Tf(n,t)+A_{t,S_{stop}} Final(n)=∑t=1Tf(n,t)+At,Sstop,其中 A t , S s t o p A_{t,S_{stop}} At,Sstop表示从 t a g t tag_t tagt到终止符号的的概率。
def _forward_alg(self, feats):
# Do the forward algorithm to compute the partition function
init_alphas = torch.Tensor(1, self.tagset_size).fill_(-10000.)
# START_TAG has all of the score.
init_alphas[0][self.tag_to_ix[START_TAG]] = 0.
# Wrap in a variable so that we will get automatic backprop
forward_var = autograd.Variable(init_alphas)
# Iterate through the sentence
for feat in feats:
alphas_t = [] # The forward variables at this timestep
for next_tag in range(self.tagset_size):
# broadcast the emission score: it is the same regardless of
# the previous tag
emit_score = feat[next_tag].view(
1, -1).expand(1, self.tagset_size)
# the ith entry of trans_score is the score of transitioning to
# next_tag from i
trans_score = self.transitions[next_tag].view(1, -1)
# The ith entry of next_tag_var is the value for the
# edge (i -> next_tag) before we do log-sum-exp
next_tag_var = forward_var + trans_score + emit_score
# The forward variable for this tag is log-sum-exp of all the
# scores.
alphas_t.append(log_sum_exp(next_tag_var))
forward_var = torch.cat(alphas_t).view(1, -1)
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
alpha = log_sum_exp(terminal_var)
return alpha
def _score_sentence(self, feats, tags):
# Gives the score of a provided tag sequence
score = torch.zeros(1)
tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
for i, feat in enumerate(feats):
score = score + \
self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
return score
在bi-lstm+crf中 p ( y ∣ X ) = e S ( X , y ) ∑ y ~ ∈ Y X e S ( X , y ~ ) p(y|X)=\frac{e^{S(X,y)}}{\sum_{\tilde{y} \in Y_{X}}e^{S(X,\tilde{y})}} p(y∣X)=∑y~∈YXeS(X,y~)eS(X,y),
所以损失函数 l o g ( p ( y ∣ X ) ) = S ( X , y ) − l o g ( ∑ y ~ ∈ Y X e S ( X , y ~ ) ) log(p(y|X))=S(X,y)-log(\sum_{\tilde{y} \in Y_{X}}e^{S(X,\tilde{y})}) log(p(y∣X))=S(X,y)−log(∑y~∈YXeS(X,y~)),
对应代码中的forward_score - gold_score
def neg_log_likelihood(self, sentence, tags):
feats = self._get_lstm_features(sentence)
forward_score = self._forward_alg(feats)
gold_score = self._score_sentence(feats, tags)
return forward_score - gold_score