【条件随机场】Linear Chain CRF原理和实现(上)

之前对CRF的了解仅是听说过的水平,这对于一个NLP博士来说确实不应该最近在项目中用到了CRF,于是参考1把linear chain CRF的理论和代码过了一遍。

对于linear chain CRF的理论,建议预先阅读2进行了解。如果在阅读时觉得书上符号太多、比较晦涩,也可以结合着这篇博客来看书。在代码实现方面,pytorch官方给出了实现1,但是和原理的对应写得比较简略,看完书的同学直接看这份代码,依然是困难的。本文的目的就是为基本了解linear chain CRF原理的读者,讲解代码实现的每个细节,完成搭建linear chain CRF的全过程。

本文的上编基于1的代码,结合原理讲一遍1的代码,逐次讲解代码的细节,还原代码实现的先后过程。同时,由于1的CRF实现只考虑了单条样本,但没有考虑对一个batch样本的处理,本文的下编实现了处理一个batch样本的CRF Layer,讲解batch内样本长短不一时,处理mask的细节,并提供相应的代码。

这篇博客中出现的符号、代码中出现的变量名,尽量与21保持一致。

Notations

Notations Meanings
x i x_i xi the input sequence x 1 , … , x n x_1, \dots, x_n x1,,xn
y i y_i yi the tag sequence y 1 , … , y n y_1, \dots, y_n y1,,yn, y i ∈ { 0 , 1 , … , ∣ T ∣ − 1 } y_i \in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace yi{0,1,,T1}
h i h_i hi the hidden representation of each x i x_i xi, h i ∈ R ∣ T ∣ h_i \in \mathbb{R}^{{\lvert T \rvert}} hiRT
T T T the tag set, including s t a r t start start and s t o p stop stop
s t a r t , s t o p start, stop start,stop two additional special tags, s t a r t , s t o p ∈ { 0 , 1 , … , ∣ T ∣ − 1 } start, stop \in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace start,stop{0,1,,T1}
P P P the (only) trainable parameter within CRF Layer, P ∈ R ∣ T ∣ ∗ ∣ T ∣ P \in \mathbb{R}^{{\lvert T \rvert}*{\lvert T \rvert}} PRTT
self.transitions = nn.Parameter(torch.randn(self.tagset_size, self.tagset_size))

Basics

Linear-chain CRF: compute a conditional probability P ( y ∣ x ) P(y|x) P(yx) given y y y (tag sequence) and x x x (input sequence of tokens).

Estimate P ( y ∣ x ) P(y|x) P(yx): calculate the sum of value of the feature functions as the estimated S c o r e ( x , y ) Score(x,y) Score(x,y) for each y y y, in which S c o r e ( x , y ) ∝ l o g   P ( y ∣ x ) = l o g   P ( y 1 , … , y n ∣ x ) Score(x,y) \propto log\ P(y|x) = log\ P(y_1, \dots, y_n|x) Score(x,y)log P(yx)=log P(y1,,ynx), and P ( y ∣ x ) = e x p ( S c o r e ( x , y ) ) ∑ y e x p ( S c o r e ( x , y ) ) P(y|x) = \frac{exp(Score(x,y))}{\sum_{y}exp(Score(x,y))} P(yx)=yexp(Score(x,y))exp(Score(x,y))

  • The feature functions for each position 1 ≤ i ≤ n 1 \leq i \leq n 1in
    • Emit score h i [ y i ] h_i[y_i] hi[yi]
      • Captures the semantic feature of this time step
    • Transition score P [ y i ] [ y i − 1 ] P[y_i][y_{i-1}] P[yi][yi1]
      • Captures the local feature within adjacent tags, regardless of the absolute position
      • When i = 1 i=1 i=1, the transition score is calculated specially, see _score_sentence
  • The feature function for the final transition
    • Transition score P [ y n + 1 ] [ y n ] P[y_{n+1}][y_n] P[yn+1][yn]
    • It is calculated specially, see _score_sentence
  • S c o r e ( x , y ) = ∑ i = 1 n h i [ y i ] + P [ y i ] [ y i − 1 ] + P [ y n + 1 ] [ y n ] Score(x, y) = \sum_{i=1}^{n}{h_i[y_i]+P[y_i][y_{i-1}]} + P[y_{n+1}][y_n] Score(x,y)=i=1nhi[yi]+P[yi][yi1]+P[yn+1][yn]

Train: optimize the model by minimizing − l o g   P ( y ∣ x ) -log\ P(y|x) log P(yx)

Part 1: _score_sentence

Calculate the score for a specific sample, i.e. estimate l o g   P ( y 0 = s t a r t , y 1 , … , y n , y n + 1 = s t o p ∣ x ) log\ P(y_0=start, y_1, \dots, y_n, y_{n+1}=stop|x) log P(y0=start,y1,,yn,yn+1=stopx) giving y , x y, x y,x

  • Add y 0 = s t a r t , y n + 1 = s t o p y_0=start, y_{n+1}=stop y0=start,yn+1=stop to the tag sequence, where s t a r t , s t o p ∈ { 0 , 1 , … , ∣ T ∣ − 1 } start, stop \in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace start,stop{0,1,,T1}

    • START_TAG = ""
      STOP_TAG = ""
      tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}
      
      self.tag_to_ix = tag_to_ix
      self.tagset_size = len(tag_to_ix)
      
      tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
      
    • After update: tags.shape = (seq_len+1,)

  • Score = sum of value of feature functions

    • For position 1 ≤ i ≤ n 1 \leq i \leq n 1in
      • Emit score h i [ y i ] h_i[y_i] hi[yi]
      • Transition score P [ y i ] [ y i − 1 ] P[y_i][y_{i-1}] P[yi][yi1]
    • For position i = n + 1 i = n+1 i=n+1
      • Transition score P [ s t o p ] [ y n ] P[stop][y_n] P[stop][yn]
    • S c o r e ( x , y ) = ∑ i = 1 n + 1 h i [ y i ] + P [ y i ] [ y i − 1 ] Score(x, y) = \sum_{i=1}^{n+1}{h_i[y_i]+P[y_i][y_{i-1}]} Score(x,y)=i=1n+1hi[yi]+P[yi][yi1]
score = torch.zeros(1)
for i, feat in enumerate(feats):
    score = score + \
    	self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
  • S c o r e ( x , y ) ∝ l o g   P ( y ∣ x ) Score(x, y) \propto log \ P(y|x) Score(x,y)log P(yx)
    • After softmax on y, S c o r e ( x , y ) Score(x, y) Score(x,y) becomes P ( y ∣ x ) P(y|x) P(yx)
  • Returns: S c o r e ( x , y ) ∝ l o g   P ( y ∣ x ) Score(x, y) \propto log \ P(y|x) Score(x,y)log P(yx)

Part 2: _forward_alg

Calculate the total score for each possible y y y given x x x, i.e. estimate l o g   P ( y 0 = s t a r t , y n + 1 = s t o p ∣ x ) log\ P(y_0=start, y_{n+1}=stop|x) log P(y0=start,yn+1=stopx)

  • The proceeding matrix M ∈ R ∣ T ∣ ∗ ∣ T ∣ M \in \mathbb{R}^{{\lvert T \rvert}*{\lvert T \rvert}} MRTT

    • For position 1 ≤ i ≤ n 1 \leq i \leq n 1in, M i [ y i ] [ y i − 1 ] = h i [ y i ] + P [ y i ] [ y i − 1 ] M_i[y_i][y_{i-1}] = h_i[y_i] + P[y_i][y_{i-1}] Mi[yi][yi1]=hi[yi]+P[yi][yi1]
    • For position i = n + 1 i = n+1 i=n+1, M i [ y i ] [ y i − 1 ] = P [ y i ] [ y i − 1 ] M_i[y_i][y_{i-1}] = P[y_i][y_{i-1}] Mi[yi][yi1]=P[yi][yi1]
    • S c o r e ( x , y ) = ∑ i = 1 n + 1 M i [ y i ] [ y i − 1 ] Score(x, y) = \sum_{i=1}^{n+1} M_i[y_i][y_{i-1}] Score(x,y)=i=1n+1Mi[yi][yi1]
  • An extrapolation

    • Consider the estimation of P ( y 0 ∣ y 0 = s t a r t , x ) P(y_0|y_0=start, x) P(y0y0=start,x)

      • α 0 [ y 0 ] = { 0 , y 0 = s t a r t − i n f , o t h e r w i s e \alpha_0[y_0] = \begin{cases}0, & y_0=start \\-inf,& otherwise\end{cases} α0[y0]={0,inf,y0=startotherwise

      • α 0 ∝ l o g   P ( y 0 ∣ y 0 = s t a r t , x ) \alpha_0 \propto log\ P(y_0|y_0=start, x) α0log P(y0y0=start,x)

      • α 0 ∈ R ∣ T ∣ \alpha_0 \in \mathbb{R}^{{\lvert T \rvert}} α0RT: init_alphas, the initial forward_var

      • init_alphas = torch.full((1, self.tagset_size), -10000.)
        init_alphas[0][self.tag_to_ix[START_TAG]] = 0.
        forward_var = init_alphas
        
    • Consider the estimation of P ( y 1 ∣ y 0 = s t a r t , x ) P(y_1|y_0=start, x) P(y1y0=start,x)

      • ∀ y 1 , y 0 ∈ T \forall y_1, y_0 \in T y1,y0T, S c o r e ( y 1 , y 0 ) = α 0 [ y 0 ] + M 1 [ y 1 ] [ y 0 ] ∝ l o g   P ( y 1 , y 0 ∣ y 0 = s t a r t , x ) Score(y_1, y_0) = \alpha_0[y_0] + M_1[y_1][y_0] \propto log \ P(y_1,y_0|y_0=start, x) Score(y1,y0)=α0[y0]+M1[y1][y0]log P(y1,y0y0=start,x)
      • Softmax over y 0 y_0 y0: α 1 = S c o r e ( y 1 ) = l o g   e x p ( S c o r e ( y 1 , y 0 ) ) ∑ y 0 e x p ( S c o r e ( y 1 , y 0 ) ) \alpha_1 = Score(y_1) = log\ \frac{exp(Score(y_1, y_0))}{\sum_{y_0}{exp(Score(y_1, y_0))}} α1=Score(y1)=log y0exp(Score(y1,y0))exp(Score(y1,y0))
    • Generalize to each time step 1 ≤ i ≤ n 1 \leq i \leq n 1in

      • Please refer to the for loop in code: the part of Iterate through the sentence

      • ∀ y i , y i − 1 ∈ T \forall y_{i}, y_{i-1} \in T yi,yi1T, S c o r e ( y i , y i − 1 ) = α i − 1 [ y i − 1 ] + M i [ y i ] [ y i − 1 ] ∝ l o g   P ( y i , y i − 1 ∣ y 0 = s t a r t , x ) Score(y_i, y_{i-1}) = \alpha_{i-1}[y_{i-1}] + M_{i}[y_{i}][y_{i-1}] \propto log \ P(y_{i},y_{i-1}|y_0=start, x) Score(yi,yi1)=αi1[yi1]+Mi[yi][yi1]log P(yi,yi1y0=start,x)

        • for next_tag in range(self.tagset_size):
              emit_score = feat[next_tag].view(
                  1, -1).expand(1, self.tagset_size)
              trans_score = self.transitions[next_tag].view(1, -1)
              next_tag_var = forward_var + trans_score + emit_score
          
      • softmax over y i − 1 y_{i-1} yi1: α i = S c o r e ( y i ) = l o g   e x p ( S c o r e ( y i , y i − 1 ) ) ∑ y i − 1 e x p ( S c o r e ( y i , y i − 1 ) ) \alpha_{i} = Score(y_i) = log\ \frac{exp(Score(y_i, y_{i-1}))}{\sum_{y_{i-1}}{exp(Score(y_{i}, y_{i-1}))}} αi=Score(yi)=log yi1exp(Score(yi,yi1))exp(Score(yi,yi1))

        • alphas_t = []  
          for next_tag in range(self.tagset_size):
              alphas_t.append(log_sum_exp(next_tag_var).view(1))
          forward_var = torch.cat(alphas_t).view(1, -1)
          
    • For time step i = n + 1 i = n+1 i=n+1

      • We only calculate S c o r e ( s t o p , y n ) = S c o r e ( y n ) + M n + 1 [ s t o p ] [ y n ] Score(stop, y_n) = Score(y_n) + M_{n+1}[stop][y_n] Score(stop,yn)=Score(yn)+Mn+1[stop][yn]

      • S c o r e ( s t o p , y n ) ∝ l o g   P ( y n + 1 = s t o p , y n ∣ y 0 = s t a r t , x ) Score(stop, y_n) \propto log\ P(y_{n+1}=stop,y_n|y_0=start, x) Score(stop,yn)log P(yn+1=stop,yny0=start,x)

      • l o g   P ( y n + 1 = s t o p ∣ y 0 = s t a r t , x ) = l o g ∑ y n e x p ( S c o r e ( s t o p , y n ) ) log\ P(y_{n+1}=stop|y_0=start, x) = log\sum_{y_n} exp(Score(stop, y_n)) log P(yn+1=stopy0=start,x)=logynexp(Score(stop,yn))

      • terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        alpha = log_sum_exp(terminal_var)
        
    • α = l o g   P ( y n + 1 = s t o p ∣ y 0 = s t a r t , x ) \alpha = log\ P(y_{n+1}=stop|y_0=start, x) α=log P(yn+1=stopy0=start,x)

      • since we always start from y 0 = s t a r t y_0=start y0=start, α = l o g   P ( y n + 1 = s t o p ∣ x ) = l o g   P ( y 0 = s t a r t , y n + 1 = s t o p ∣ x ) \alpha = log\ P(y_{n+1}=stop|x) = log\ P(y_0=start, y_{n+1}=stop|x) α=log P(yn+1=stopx)=log P(y0=start,yn+1=stopx)
    • Returns: α = l o g   P ( y 0 = s t a r t , y n + 1 = s t o p ∣ x ) \alpha = log\ P(y_0=start, y_{n+1}=stop|x) α=log P(y0=start,yn+1=stopx)

Part 3: neg_log_likelihood

The training objective for CRF model: minimize the negative log likelihood of P ( y ∣ x ) P(y|x) P(yx)

  • Forward score α \alpha α

    • α = l o g   P ( y 0 = s t a r t , y n + 1 = s t o p ∣ x ) \alpha = log\ P(y_0=start, y_{n+1}=stop|x) α=log P(y0=start,yn+1=stopx)
  • Gold score S c o r e ( x , y ) Score(x, y) Score(x,y)

    • S c o r e ( x , y ) = l o g   P ( y 0 = s t a r t , y 1 , … , y n , y n + 1 = s t o p ∣ x ) Score(x, y) = log\ P(y_0=start, y_1, \dots, y_n, y_{n+1}=stop|x) Score(x,y)=log P(y0=start,y1,,yn,yn+1=stopx)
  • The loss

    • l o s s = α − S c o r e ( x , y ) = − l o g ( y 1 , … , y n ∣ y 0 = s t a r t , y n + 1 = s t o p , x ) loss = \alpha-Score(x, y) = -log(y_1,\dots,y_n|y_0=start, y_{n+1}=stop, x) loss=αScore(x,y)=log(y1,,yny0=start,yn+1=stop,x)
  • Since all sequences begin with s t a r t start start and end with s t o p stop stop, l o s s = − l o g ( y 1 , … , y n ∣ x ) loss = -log(y_1,\dots,y_n|x) loss=log(y1,,ynx)

  • forward_score = self._forward_alg(feats)
    gold_score = self._score_sentence(feats, tags)
    return forward_score - gold_score
    

Part 4: _viterbi_decode

Make predictions based on CRF model: y ∗ = a r g m a x y   S c o r e ( x , y ) y^*=\underset{y}{argmax}\ Score(x, y) y=yargmax Score(x,y)

  • Notations

    • S c o r e ( x , y , i ) = ∑ i = 1 i h i [ y i ] + P [ y i ] [ y i − 1 ] Score(x,y,{\rm{i}})=\sum_{i=1}^{{\rm{i}}}{h_i[y_i]+P[y_i][y_{i-1}]} Score(x,y,i)=i=1ihi[yi]+P[yi][yi1]
    • α i = m a x y i   S c o r e ( x , y , i ) ∈ R ∣ T ∣ \alpha_i = \underset{y_i}{max}\ Score(x, y, i) \in \mathbb{R}^{{\lvert T \rvert}} αi=yimax Score(x,y,i)RT, where α i [ y j ] = m a x   S c o r e ( x , y , i )   ∣ y i = y j \alpha_{i}[y_j]= max\ Score(x, y, i)\ |_{y_i=y_j} αi[yj]=max Score(x,y,i) yi=yj
  • The initial score α 0 \alpha_0 α0

    • α 0 [ y j ] = S c o r e ( x , y , 0 )   ∣ y 0 = y j \alpha_{0}[y_j]=Score(x, y, 0)\ |_{y_0=y_j} α0[yj]=Score(x,y,0) y0=yj

    • α 0 [ y 0 ] = { 0 , y 0 = s t a r t − i n f , o t h e r w i s e \alpha_0[y_0] = \begin{cases}0, & y_0=start \\-inf,& otherwise\end{cases} α0[y0]={0,inf,y0=startotherwise

    • α 0 ∈ R ∣ T ∣ \alpha_0 \in \mathbb{R}^{{\lvert T \rvert}} α0RT: init_vvars, the initial forward_var

  • Go forward, for position 1 ≤ i ≤ n 1 \leq i \leq n 1in

    • Please refer to the for loop in code

    • Algorithm

      • notice that S c o r e ( x , y , i ) = S c o r e ( x , y , i − 1 ) + h i [ y i ] + P [ y i ] [ y i − 1 ] Score(x, y, i) = Score(x, y, i-1) + h_i[y_i]+P[y_i][y_{i-1}] Score(x,y,i)=Score(x,y,i1)+hi[yi]+P[yi][yi1]

      • α i ′ [ y i ] = m a x y i − 1   α i − 1 [ y i − 1 ] + P [ y i ] [ y i − 1 ] \alpha_i'[y_i]=\underset{y_{i-1}}{max}\ \alpha_{i-1}[y_{i-1}]+P[y_i][y_{i-1}] αi[yi]=yi1max αi1[yi1]+P[yi][yi1]

        • next_tag_var = forward_var + self.transitions[next_tag]
          best_tag_id = argmax(next_tag_var)
          
      • α i [ y i ] = α i ′ [ y i ] + h i [ y i ] \alpha_i[y_i] = \alpha_i'[y_i]+h_i[y_i] αi[yi]=αi[yi]+hi[yi]

        • for next_tag in range(self.tagset_size):
          	viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
          forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
          
    • Comments

      • The correctness is obvious
      • In the process, the information of current best path is recorded in
        • each time step: bptrs_t [ y i ] = a r g m a x y i − 1   α i − 1 [ y i − 1 ] + P [ y i ] [ y i − 1 ] [y_i]=\underset{y_{i-1}}{argmax}\ \alpha_{i-1}[y_{i-1}]+P[y_i][y_{i-1}] [yi]=yi1argmax αi1[yi1]+P[yi][yi1], and is recorded by backpointers
  • For time step n + 1 n+1 n+1

    • S c o r e ( x , y ) = m a x y n   α n [ y n ] + P [ s t o p ] [ y n ] Score(x, y)=\underset{y_{n}}{max}\ \alpha_{n}[y_{n}]+P[stop][y_{n}] Score(x,y)=ynmax αn[yn]+P[stop][yn]

    • terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
      best_tag_id = argmax(terminal_var)
      path_score = terminal_var[0][best_tag_id]
      
  • Returns: the path score; the best path

    • the best path is restored by tracing back backpointers

References


  1. Making dynamic decisions and the BiLSTM-CRF link ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  2. 《统计学习方法》第11章条件随机场的11.2,11.3和11.5 ↩︎ ↩︎

你可能感兴趣的:(机器学习与人工智能,CRF,条件随机场,线性链,深度学习,机器学习)