之前对CRF的了解仅是听说过的水平,这对于一个NLP博士来说确实不应该最近在项目中用到了CRF,于是参考1把linear chain CRF的理论和代码过了一遍。
对于linear chain CRF的理论,建议预先阅读2进行了解。如果在阅读时觉得书上符号太多、比较晦涩,也可以结合着这篇博客来看书。在代码实现方面,pytorch官方给出了实现1,但是和原理的对应写得比较简略,看完书的同学直接看这份代码,依然是困难的。本文的目的就是为基本了解linear chain CRF原理的读者,讲解代码实现的每个细节,完成搭建linear chain CRF的全过程。
本文的上编基于1的代码,结合原理讲一遍1的代码,逐次讲解代码的细节,还原代码实现的先后过程。同时,由于1的CRF实现只考虑了单条样本,但没有考虑对一个batch样本的处理,本文的下编实现了处理一个batch样本的CRF Layer,讲解batch内样本长短不一时,处理mask的细节,并提供相应的代码。
这篇博客中出现的符号、代码中出现的变量名,尽量与2和1保持一致。
Notations | Meanings |
---|---|
x i x_i xi | the input sequence x 1 , … , x n x_1, \dots, x_n x1,…,xn |
y i y_i yi | the tag sequence y 1 , … , y n y_1, \dots, y_n y1,…,yn, y i ∈ { 0 , 1 , … , ∣ T ∣ − 1 } y_i \in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace yi∈{0,1,…,∣T∣−1} |
h i h_i hi | the hidden representation of each x i x_i xi, h i ∈ R ∣ T ∣ h_i \in \mathbb{R}^{{\lvert T \rvert}} hi∈R∣T∣ |
T T T | the tag set, including s t a r t start start and s t o p stop stop |
s t a r t , s t o p start, stop start,stop | two additional special tags, s t a r t , s t o p ∈ { 0 , 1 , … , ∣ T ∣ − 1 } start, stop \in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace start,stop∈{0,1,…,∣T∣−1} |
P P P | the (only) trainable parameter within CRF Layer, P ∈ R ∣ T ∣ ∗ ∣ T ∣ P \in \mathbb{R}^{{\lvert T \rvert}*{\lvert T \rvert}} P∈R∣T∣∗∣T∣self.transitions = nn.Parameter(torch.randn(self.tagset_size, self.tagset_size)) |
Linear-chain CRF: compute a conditional probability P ( y ∣ x ) P(y|x) P(y∣x) given y y y (tag sequence) and x x x (input sequence of tokens).
Estimate P ( y ∣ x ) P(y|x) P(y∣x): calculate the sum of value of the feature functions as the estimated S c o r e ( x , y ) Score(x,y) Score(x,y) for each y y y, in which S c o r e ( x , y ) ∝ l o g P ( y ∣ x ) = l o g P ( y 1 , … , y n ∣ x ) Score(x,y) \propto log\ P(y|x) = log\ P(y_1, \dots, y_n|x) Score(x,y)∝log P(y∣x)=log P(y1,…,yn∣x), and P ( y ∣ x ) = e x p ( S c o r e ( x , y ) ) ∑ y e x p ( S c o r e ( x , y ) ) P(y|x) = \frac{exp(Score(x,y))}{\sum_{y}exp(Score(x,y))} P(y∣x)=∑yexp(Score(x,y))exp(Score(x,y))
_score_sentence
_score_sentence
Train: optimize the model by minimizing − l o g P ( y ∣ x ) -log\ P(y|x) −log P(y∣x)
Calculate the score for a specific sample, i.e. estimate l o g P ( y 0 = s t a r t , y 1 , … , y n , y n + 1 = s t o p ∣ x ) log\ P(y_0=start, y_1, \dots, y_n, y_{n+1}=stop|x) log P(y0=start,y1,…,yn,yn+1=stop∣x) giving y , x y, x y,x
Add y 0 = s t a r t , y n + 1 = s t o p y_0=start, y_{n+1}=stop y0=start,yn+1=stop to the tag sequence, where s t a r t , s t o p ∈ { 0 , 1 , … , ∣ T ∣ − 1 } start, stop \in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace start,stop∈{0,1,…,∣T∣−1}
START_TAG = ""
STOP_TAG = ""
tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}
self.tag_to_ix = tag_to_ix
self.tagset_size = len(tag_to_ix)
tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
After update: tags.shape = (seq_len+1,)
Score = sum of value of feature functions
score = torch.zeros(1)
for i, feat in enumerate(feats):
score = score + \
self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
Calculate the total score for each possible y y y given x x x, i.e. estimate l o g P ( y 0 = s t a r t , y n + 1 = s t o p ∣ x ) log\ P(y_0=start, y_{n+1}=stop|x) log P(y0=start,yn+1=stop∣x)
The proceeding matrix M ∈ R ∣ T ∣ ∗ ∣ T ∣ M \in \mathbb{R}^{{\lvert T \rvert}*{\lvert T \rvert}} M∈R∣T∣∗∣T∣
An extrapolation
Consider the estimation of P ( y 0 ∣ y 0 = s t a r t , x ) P(y_0|y_0=start, x) P(y0∣y0=start,x)
α 0 [ y 0 ] = { 0 , y 0 = s t a r t − i n f , o t h e r w i s e \alpha_0[y_0] = \begin{cases}0, & y_0=start \\-inf,& otherwise\end{cases} α0[y0]={0,−inf,y0=startotherwise
α 0 ∝ l o g P ( y 0 ∣ y 0 = s t a r t , x ) \alpha_0 \propto log\ P(y_0|y_0=start, x) α0∝log P(y0∣y0=start,x)
α 0 ∈ R ∣ T ∣ \alpha_0 \in \mathbb{R}^{{\lvert T \rvert}} α0∈R∣T∣: init_alphas
, the initial forward_var
init_alphas = torch.full((1, self.tagset_size), -10000.)
init_alphas[0][self.tag_to_ix[START_TAG]] = 0.
forward_var = init_alphas
Consider the estimation of P ( y 1 ∣ y 0 = s t a r t , x ) P(y_1|y_0=start, x) P(y1∣y0=start,x)
Generalize to each time step 1 ≤ i ≤ n 1 \leq i \leq n 1≤i≤n
Please refer to the for loop in code: the part of Iterate through the sentence
∀ y i , y i − 1 ∈ T \forall y_{i}, y_{i-1} \in T ∀yi,yi−1∈T, S c o r e ( y i , y i − 1 ) = α i − 1 [ y i − 1 ] + M i [ y i ] [ y i − 1 ] ∝ l o g P ( y i , y i − 1 ∣ y 0 = s t a r t , x ) Score(y_i, y_{i-1}) = \alpha_{i-1}[y_{i-1}] + M_{i}[y_{i}][y_{i-1}] \propto log \ P(y_{i},y_{i-1}|y_0=start, x) Score(yi,yi−1)=αi−1[yi−1]+Mi[yi][yi−1]∝log P(yi,yi−1∣y0=start,x)
for next_tag in range(self.tagset_size):
emit_score = feat[next_tag].view(
1, -1).expand(1, self.tagset_size)
trans_score = self.transitions[next_tag].view(1, -1)
next_tag_var = forward_var + trans_score + emit_score
softmax over y i − 1 y_{i-1} yi−1: α i = S c o r e ( y i ) = l o g e x p ( S c o r e ( y i , y i − 1 ) ) ∑ y i − 1 e x p ( S c o r e ( y i , y i − 1 ) ) \alpha_{i} = Score(y_i) = log\ \frac{exp(Score(y_i, y_{i-1}))}{\sum_{y_{i-1}}{exp(Score(y_{i}, y_{i-1}))}} αi=Score(yi)=log ∑yi−1exp(Score(yi,yi−1))exp(Score(yi,yi−1))
alphas_t = []
for next_tag in range(self.tagset_size):
alphas_t.append(log_sum_exp(next_tag_var).view(1))
forward_var = torch.cat(alphas_t).view(1, -1)
For time step i = n + 1 i = n+1 i=n+1
We only calculate S c o r e ( s t o p , y n ) = S c o r e ( y n ) + M n + 1 [ s t o p ] [ y n ] Score(stop, y_n) = Score(y_n) + M_{n+1}[stop][y_n] Score(stop,yn)=Score(yn)+Mn+1[stop][yn]
S c o r e ( s t o p , y n ) ∝ l o g P ( y n + 1 = s t o p , y n ∣ y 0 = s t a r t , x ) Score(stop, y_n) \propto log\ P(y_{n+1}=stop,y_n|y_0=start, x) Score(stop,yn)∝log P(yn+1=stop,yn∣y0=start,x)
l o g P ( y n + 1 = s t o p ∣ y 0 = s t a r t , x ) = l o g ∑ y n e x p ( S c o r e ( s t o p , y n ) ) log\ P(y_{n+1}=stop|y_0=start, x) = log\sum_{y_n} exp(Score(stop, y_n)) log P(yn+1=stop∣y0=start,x)=log∑ynexp(Score(stop,yn))
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
alpha = log_sum_exp(terminal_var)
α = l o g P ( y n + 1 = s t o p ∣ y 0 = s t a r t , x ) \alpha = log\ P(y_{n+1}=stop|y_0=start, x) α=log P(yn+1=stop∣y0=start,x)
Returns: α = l o g P ( y 0 = s t a r t , y n + 1 = s t o p ∣ x ) \alpha = log\ P(y_0=start, y_{n+1}=stop|x) α=log P(y0=start,yn+1=stop∣x)
The training objective for CRF model: minimize the negative log likelihood of P ( y ∣ x ) P(y|x) P(y∣x)
Forward score α \alpha α
Gold score S c o r e ( x , y ) Score(x, y) Score(x,y)
The loss
Since all sequences begin with s t a r t start start and end with s t o p stop stop, l o s s = − l o g ( y 1 , … , y n ∣ x ) loss = -log(y_1,\dots,y_n|x) loss=−log(y1,…,yn∣x)
forward_score = self._forward_alg(feats)
gold_score = self._score_sentence(feats, tags)
return forward_score - gold_score
Make predictions based on CRF model: y ∗ = a r g m a x y S c o r e ( x , y ) y^*=\underset{y}{argmax}\ Score(x, y) y∗=yargmax Score(x,y)
Notations
The initial score α 0 \alpha_0 α0
α 0 [ y j ] = S c o r e ( x , y , 0 ) ∣ y 0 = y j \alpha_{0}[y_j]=Score(x, y, 0)\ |_{y_0=y_j} α0[yj]=Score(x,y,0) ∣y0=yj
α 0 [ y 0 ] = { 0 , y 0 = s t a r t − i n f , o t h e r w i s e \alpha_0[y_0] = \begin{cases}0, & y_0=start \\-inf,& otherwise\end{cases} α0[y0]={0,−inf,y0=startotherwise
α 0 ∈ R ∣ T ∣ \alpha_0 \in \mathbb{R}^{{\lvert T \rvert}} α0∈R∣T∣: init_vvars
, the initial forward_var
Go forward, for position 1 ≤ i ≤ n 1 \leq i \leq n 1≤i≤n
Please refer to the for loop in code
Algorithm
notice that S c o r e ( x , y , i ) = S c o r e ( x , y , i − 1 ) + h i [ y i ] + P [ y i ] [ y i − 1 ] Score(x, y, i) = Score(x, y, i-1) + h_i[y_i]+P[y_i][y_{i-1}] Score(x,y,i)=Score(x,y,i−1)+hi[yi]+P[yi][yi−1]
α i ′ [ y i ] = m a x y i − 1 α i − 1 [ y i − 1 ] + P [ y i ] [ y i − 1 ] \alpha_i'[y_i]=\underset{y_{i-1}}{max}\ \alpha_{i-1}[y_{i-1}]+P[y_i][y_{i-1}] αi′[yi]=yi−1max αi−1[yi−1]+P[yi][yi−1]
next_tag_var = forward_var + self.transitions[next_tag]
best_tag_id = argmax(next_tag_var)
α i [ y i ] = α i ′ [ y i ] + h i [ y i ] \alpha_i[y_i] = \alpha_i'[y_i]+h_i[y_i] αi[yi]=αi′[yi]+hi[yi]
for next_tag in range(self.tagset_size):
viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
Comments
bptrs_t
[ y i ] = a r g m a x y i − 1 α i − 1 [ y i − 1 ] + P [ y i ] [ y i − 1 ] [y_i]=\underset{y_{i-1}}{argmax}\ \alpha_{i-1}[y_{i-1}]+P[y_i][y_{i-1}] [yi]=yi−1argmax αi−1[yi−1]+P[yi][yi−1], and is recorded by backpointers
For time step n + 1 n+1 n+1
S c o r e ( x , y ) = m a x y n α n [ y n ] + P [ s t o p ] [ y n ] Score(x, y)=\underset{y_{n}}{max}\ \alpha_{n}[y_{n}]+P[stop][y_{n}] Score(x,y)=ynmax αn[yn]+P[stop][yn]
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
best_tag_id = argmax(terminal_var)
path_score = terminal_var[0][best_tag_id]
Returns: the path score; the best path
backpointers
Making dynamic decisions and the BiLSTM-CRF link ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
《统计学习方法》第11章条件随机场的11.2,11.3和11.5 ↩︎ ↩︎