对Pytorch Bi-LSTM + CRF 代码详解

本文主要解释torch的官网对Pytorch Bi-LSTM + CRF 代码是如何计算 l o g ( ∑ y ~ ∈ Y X e S ( X , y ~ ) ) log(\sum_{\tilde{y} \in Y_{X}}e^{S(X,\tilde{y})}) log(y~YXeS(X,y~))

公式为: l o g ( ∑ y ~ ∈ Y X e S ( X , y ~ ) ) log(\sum_{\tilde{y} \in Y_{X}}e^{S(X,\tilde{y})}) log(y~YXeS(X,y~))(后面用logsumexp表示)

在此之前可以先看完再看看这位的博客
Bi-LSTM-CRF for Sequence Labeling PENG
首先是定义 S ( X , y ~ ) S(X,\tilde{y}) S(X,y~)是对于输入序列 X X X对应的输出tag序列 y y y的分数:
S ( X , y ~ ) = ∑ i = 0 n − 1 A y i , y i + 1 + ∑ i = 0 n P i , y i , n 为 序 列 长 度 S(X,\tilde{y})=\sum_{i=0}^{n-1}A_{y_i,y_{i+1}}+\sum_{i=0}^nP_{i,y_i} ,n为序列长度 S(X,y~)=i=0n1Ayi,yi+1+i=0nPi,yi,n
其中 A A A是tag的转移概率矩阵, P P P是Bi-LISTM的输出矩阵,其中 P i , j P_{i,j} Pi,j代表词 w i w_i wi映射到 t a g j tag_j tagj的非归一化概率。

代码处理

在pytorch官网提供的代码中,不是先求出每一种路径的分数在求和,而是利用前向算法的思想来求解:
对于求到序列i位置的logsumexp分数,可以先求出到序列i-1的logsumexp分数,在加上转移概率和Bi-LSTM的
输出的概率。定义 f ( i , s j ) = l o g ( ∑ y ~ ∈ Y i , y i = s j e S i ( X , y ~ , s j ) ) f(i,s_j)=log(\sum_{\tilde{y} \in Y_i ,{y_i=s_j}} e^{S_i(X,\tilde{y},s_j)}) f(i,sj)=log(y~Yi,yi=sjeSi(X,y~,sj))表示到序列长度i,最后的状态为 s j s_j sj的所有路径logsumexp分数和。所以要构造一个函数:
f ( i , s j ) = F ( f ( i − 1 , s 1 ) , f ( i − 1 , s 2 ) , . . . , f ( i − 1 , s T ) ) = F ( i − 1 ) , T 表 示 状 态 个 数 f(i,s_j)=F(f(i-1,s_1),f(i-1,s_2),...,f(i-1,s_T))=F(i-1),T表示状态个数 f(i,sj)=F(f(i1,s1),f(i1,s2),...,f(i1,sT))=F(i1),T
可以利用公式 l o g ( ∑ e l o g ( ∑ e x ) + y ) = l o g ( ∑ ∑ e x + y ) log(\sum e^{log(\sum e^x)+y})=log(\sum\sum e^{x+y}) log(elog(ex)+y)=log(ex+y)(这个公式自己动手推一下是
成立的),来计算 f ( i , s j ) f(i,s_j) f(i,sj) f ( i − 1 , s t ) f(i-1,s_t) f(i1,st)的关系。
l o g ( ∑ t = 1 T e f ( i − 1 , s t ) + A s t , s j + P i , j ) = l o g ( ∑ t = 1 T e A s t , s j + P i , j ∗ e f ( i − 1 , s t ) ) = l o g ( ∑ t = 1 T e A s t , s j + P i , j ∗ e l o g ( ∑ ( y ~ ∈ Y i − 1 , y i − 1 = s t ) e S i − 1 ( X , y ~ , s t ) ) ) = l o g ( ∑ t = 1 T e A s t , s j + P i , j ∗ ∑ y ~ ∈ Y i − 1 , y i − 1 = s t e S i − 1 ( X , y ~ , s t ) ) = l o g ( ∑ t = 1 T ∑ y ~ ∈ Y i − 1 , y i − 1 = s t e S i − 1 ( X , y ~ , s t ) + A s t , s j + P i , j ) = l o g ( ∑ t = 1 T ∑ y ~ ∈ Y i , y i − 1 = s t , y i = s j e S i ( X , y ~ , s j ) ) = l o g ( ∑ y ~ ∈ Y i , y i = s j e S i ( X , y ~ , s j ) ) = f ( i , s j ) \begin{aligned} &log(\sum _{t=1}^{T}e^{f(i-1,s_t)+A_{s_t,s_j}+P_{i,j}}) \\ &=log(\sum _{t=1}^{T}e^{A_{s_t,s_j}+P_{i,j}}*e^{f(i-1,s_t)}) \\ &=log(\sum _{t=1}^{T}e^{A_{s_t,s_j}+P_{i,j}}*e^{log(\sum_{(\tilde{y} \in Y_{i-1},y_{i-1}=s_t)} e^{S_{i-1}(X,\tilde{y},s_t)})}) \\ &=log(\sum _{t=1}^{T}e^{A_{s_t,s_j}+P_{i,j}}* \sum_{\tilde{y} \in Y_{i-1},y_{i-1}=s_t} e^{S_{i-1}(X,\tilde{y},s_t)}) \\ &=log(\sum _{t=1}^{T}\sum_{\tilde{y} \in Y_{i-1},y_{i-1}=s_t} e^{S_{i-1}(X,\tilde{y},s_t)+A_{s_t,s_j}+P_{i,j}}) \\ &=log(\sum _{t=1}^{T}\sum_{\tilde{y} \in Y_{i},y_{i-1}=s_t,y_i=s_j} e^{S_{i}(X,\tilde{y},s_j)}) \\ &=log(\sum_{\tilde{y} \in Y_{i},y_i=s_j} e^{S_{i}(X,\tilde{y},s_j)}) \\ &=f(i,s_j) \end{aligned} log(t=1Tef(i1,st)+Ast,sj+Pi,j)=log(t=1TeAst,sj+Pi,jef(i1,st))=log(t=1TeAst,sj+Pi,jelog((y~Yi1,yi1=st)eSi1(X,y~,st)))=log(t=1TeAst,sj+Pi,jy~Yi1,yi1=steSi1(X,y~,st))=log(t=1Ty~Yi1,yi1=steSi1(X,y~,st)+Ast,sj+Pi,j)=log(t=1Ty~Yi,yi1=st,yi=sjeSi(X,y~,sj))=log(y~Yi,yi=sjeSi(X,y~,sj))=f(i,sj)

对应到代码里,每一步的状态的logsumexp分数等于上一步的每个状态的logsumexp分数+转移概率+BI-LSTM的输出概率:

l o g s u m e x p i , j = l o g ( ∑ t = 1 T e ( l o g s u m e x p i − 1 , t + A i , j + P i , t ) ) logsumexp_{i,j}=log(\sum_{t=1}^Te^{(logsumexp_{i-1,t}+A_{i,j}+P_{i,t})}) logsumexpi,j=log(t=1Te(logsumexpi1,t+Ai,j+Pi,t))

这个公式对应标注场景:

位 置 i 的 词 w i 的 对 应 的 l a b e l 为 t a g j 的 l o g s u m e x p 分 数 = l o g ( ∑ t = 1 T e ( 位 置 i − 1 的 标 签 为 t a g t 的 l o g s u m e x p 分 数 + 标 签 t a g t 到 t a g j 的 转 移 概 率 + b i l s t m 的 输 出 矩 阵 中 词 w i − 1 映 射 到 t a g t 的 概 率 ) ) \begin{aligned} 位置i的词w_i的对应的label为tag_j的logsumexp分数= log(\sum_{t=1}^Te^{(位置i-1的标签为tag_t的logsumexp分数+ 标签tag_t到tag_j的转移概率+bilstm的输出矩阵中词w_{i-1}映射到tag_t的概率)}) \end{aligned} iwilabeltagjlogsumexp=log(t=1Te(i1tagtlogsumexp+tagttagj+bilstmwi1tagt))
最后,长度为n的序列logsumexp分数为 F i n a l ( n ) = ∑ t = 1 T f ( n , t ) + A t , S s t o p Final(n)=\sum_{t=1}^Tf(n,t)+A_{t,S_{stop}} Final(n)=t=1Tf(n,t)+At,Sstop,其中 A t , S s t o p A_{t,S_{stop}} At,Sstop表示从 t a g t tag_t tagt到终止符号的的概率。

以下是计算 l o g ( ∑ y ~ ∈ Y X e S ( X , y ~ ) ) log(\sum_{\tilde{y} \in Y_{X}}e^{S(X,\tilde{y})}) log(y~YXeS(X,y~))的代码

def _forward_alg(self, feats):
    # Do the forward algorithm to compute the partition function
    init_alphas = torch.Tensor(1, self.tagset_size).fill_(-10000.)
    # START_TAG has all of the score.
    init_alphas[0][self.tag_to_ix[START_TAG]] = 0.

    # Wrap in a variable so that we will get automatic backprop
    forward_var = autograd.Variable(init_alphas)

    # Iterate through the sentence
    for feat in feats:
        alphas_t = []  # The forward variables at this timestep
        for next_tag in range(self.tagset_size):
            # broadcast the emission score: it is the same regardless of
            # the previous tag
            emit_score = feat[next_tag].view(
                1, -1).expand(1, self.tagset_size)
            # the ith entry of trans_score is the score of transitioning to
            # next_tag from i
            trans_score = self.transitions[next_tag].view(1, -1)
            # The ith entry of next_tag_var is the value for the
            # edge (i -> next_tag) before we do log-sum-exp
            next_tag_var = forward_var + trans_score + emit_score
            # The forward variable for this tag is log-sum-exp of all the
            # scores.
            alphas_t.append(log_sum_exp(next_tag_var))
        forward_var = torch.cat(alphas_t).view(1, -1)
    terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
    alpha = log_sum_exp(terminal_var)
    return alpha

以下是计算目标句子的 S ( X , y ) S(X,y) S(X,y)分数,参数feats是bi-lstm的输出,

参数tag是句子的标注序列

 def _score_sentence(self, feats, tags):
        # Gives the score of a provided tag sequence
        score = torch.zeros(1)
        tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
        for i, feat in enumerate(feats):
            score = score + \
                self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
        score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
        return score

以下是计算损失函数的代码

在bi-lstm+crf中 p ( y ∣ X ) = e S ( X , y ) ∑ y ~ ∈ Y X e S ( X , y ~ ) p(y|X)=\frac{e^{S(X,y)}}{\sum_{\tilde{y} \in Y_{X}}e^{S(X,\tilde{y})}} p(yX)=y~YXeS(X,y~)eS(X,y)
所以损失函数 l o g ( p ( y ∣ X ) ) = S ( X , y ) − l o g ( ∑ y ~ ∈ Y X e S ( X , y ~ ) ) log(p(y|X))=S(X,y)-log(\sum_{\tilde{y} \in Y_{X}}e^{S(X,\tilde{y})}) log(p(yX))=S(X,y)log(y~YXeS(X,y~)),
对应代码中的forward_score - gold_score

    def neg_log_likelihood(self, sentence, tags):
        feats = self._get_lstm_features(sentence)
        forward_score = self._forward_alg(feats)
        gold_score = self._score_sentence(feats, tags)
        return forward_score - gold_score

你可能感兴趣的:(NLP)