基础的理论推导我就不再搬运了,网上有很多大大们写的都很好,但是我发现文章基本分成了两类,一类讲理论讲的特别好,但是缺少了与实际代码的结合;一类讲实践,主要是如何使用顺带提一下公式,主要是默认读者已经对公式烂熟于心了。所以我想做个桥梁,结合公式和代码实现把CRF捋一遍。对于理论还不是很熟的童鞋请参考文章最后的引用链接,讲的非常详细。文章中使用的代码来自于pytorch官方文档:https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html
生成模型和判别式模型
简单一点理解,生成模型就是根据数据集去建立联合概率分布 P ( X , Y ) P(X,Y) P(X,Y)的模型,比如贝叶斯模型、HMM;判别式模式就是根据数据集中X和Y的对应关系去建立一个超平面用于分类,判别式模型的目的就是为了分类,比如SVM、神经网络。切记,切记HMM是生成模型而CRF是判别式模型! 更详细的解释请参考https://zhuanlan.zhihu.com/p/33397147里的2.2节。
无向图的概率计算
设有无向图,一般指马尔科夫网络,可以用因子分解将 P ( Y ) P(Y) P(Y) 写成若干个联合概率的乘积,这里的“若干个联合概率”是指将图分解成若干个最大团的乘积。比如下图中可以分解成两个最大团 ( x 1 , x 3 , x 4 ) 和 ( x 2 , x 3 , x 4 ) (x_1, x_3, x_4)和(x_2, x_3, x_4) (x1,x3,x4)和(x2,x3,x4),所以:
P ( Y ) = 1 Z ( x ) ∏ c = 1 2 ψ c ( c ) = 1 Z ( x ) ψ 1 ( x 1 , x 3 , x 4 ) ⋅ ψ 2 ( x 1 , x 3 , x 4 ) P(Y)=\frac{1}{Z(x)}\prod_{c=1}^{2}\psi_c(c)=\frac{1}{Z(x)}\psi_1(x_1, x_3, x_4)\cdot\psi_2(x_1, x_3, x_4) P(Y)=Z(x)1∏c=12ψc(c)=Z(x)1ψ1(x1,x3,x4)⋅ψ2(x1,x3,x4)
其中 Z ( x ) = ∑ Y ∏ c ψ c ( c ) Z(x)= \sum_{Y}\prod_{c}\psi_c (c) Z(x)=∑Y∏cψc(c)是归一化因子,这里可以参考softmax公式的分母。
ψ c ( c ) \psi_c(c) ψc(c)的定义
ψ c ( c ) \psi_c(c) ψc(c)是一个最大团 C C C上随机变量们的联合概率,一般取指数函数:
ψ c ( Y c ) = e − E ( Y c ) = e ∑ k λ k f k ( c , y , x ) \psi_c (Y_c)=e^{-E(Y_c)}=e^{\sum_{k}\lambda_k f_k(c,y,x)} ψc(Yc)=e−E(Yc)=e∑kλkfk(c,y,x),先不管 λ k f k ( c , y , x ) \lambda_k f_k(c,y,x) λkfk(c,y,x)是个什么鬼,把这一坨带入到上面的计算 P ( Y ) P(Y) P(Y)的公式中:
P ( Y ) = 1 Z ( x ) ∏ c e ∑ k λ k f k ( c , y , x ) = 1 Z ( x ) e ∑ c ∑ k λ k f k ( c , y , x ) P(Y)=\frac{1}{Z(x)}\prod_{c}e^{\sum_{k}\lambda_k f_k(c,y,x)}=\frac{1}{Z(x)}e^{\sum_c\sum_{k}\lambda_k f_k(c,y,x)} P(Y)=Z(x)1∏ce∑kλkfk(c,y,x)=Z(x)1e∑c∑kλkfk(c,y,x)
现在是不是已经得到的crf的计算公式,到底要如何计算后面再说,先看看这个式子是怎么来的。首先,为什么长这个样子?因为弄这个东东出来的目的就是在给定序列 X X X的情况下判定序列 Y Y Y的,奔着 P ( Y ∣ X ) P(Y|X) P(Y∣X)去的,判别式模型!其次,为什么在最后的公式中连乘符号没了?因为 ψ c ( c ) \psi_c(c) ψc(c)这东东被定义成了指数函数,所以连乘变成了求和!
如何最大化 P ( Y ) P(Y) P(Y)
继续不管刚才那一坨怎么算,先从整体上把流程搞定再说细节。
根据概率论课上老师教的,我们可以使用最大似然估计来计算分布的参数,即我们的目标就是最大化 l o g P ( Y ) logP(Y) logP(Y)
l o g P ( Y ) = l o g 1 Z ( x ) e ∑ c ∑ k λ k f k ( c , y , x ) = ∑ c ∑ k λ k f k ( c , y , x ) − l o g Z ( x ) logP(Y) = log\frac{1}{Z(x)}e^{\sum_c\sum_{k}\lambda_k f_k(c,y,x)}=\sum_c\sum_{k}\lambda_k f_k(c,y,x) - logZ(x) logP(Y)=logZ(x)1e∑c∑kλkfk(c,y,x)=∑c∑kλkfk(c,y,x)−logZ(x)
最大化 l o g P ( Y ) logP(Y) logP(Y)等价于最小化 − l o g P ( Y ) -logP(Y) −logP(Y),所以目标就变成了最小化 − l o g P ( Y ) -logP(Y) −logP(Y)
− l o g P ( Y ) = l o g Z ( x ) − ∑ c ∑ k λ k f k ( c , y , x ) -logP(Y) = logZ(x) - \sum_c\sum_{k}\lambda_k f_k(c,y,x) −logP(Y)=logZ(x)−∑c∑kλkfk(c,y,x)
对应到代码实现中,关注neg_log_likelihood()函数的最后一行
def neg_log_likelihood(self, sentence, tags):
feats = self._get_lstm_features(sentence)
forward_score = self._forward_alg(feats)
gold_score = self._score_sentence(feats, tags)
return forward_score - gold_score
forward_score 就是我们的 l o g Z ( x ) logZ(x) logZ(x), gold_score是后面的那一坨。
将图对应到公式中,k = 4对于每一步 t t t都需要知道 h k t + 1 ( i k ; X ) h_k^{t+1}(i_{k};X) hkt+1(ik;X)
而 Z i t Z_i^{t} Zit表示在当前时刻 t 以状态(标签) y 1 , … , y k y_1,…,y_k y1,…,yk 为终点的所有路径的得分指数和,即前向算法中的前向变量。我们来关注一下 Z i t + 1 Z_i^{t+1} Zit+1是怎么来的, Z 1 t + 1 Z_1^{t+1} Z1t+1是图中(在时刻t中所有红色的连线) × \times ×(在t+1时刻状态为1的值)
Z 1 t + 1 = Z 1 t ⋅ G 11 ⋅ H t + 1 ( 1 ∣ X ) + Z 1 t ⋅ G 21 ⋅ H t + 1 ( 1 ∣ X ) + Z 1 t ⋅ G 31 ⋅ H t + 1 ( 1 ∣ X ) + Z 1 t ⋅ G 41 ⋅ H t + 1 ( 1 ∣ X ) Z_1^{t+1}=Z_1^{t}\cdot G_{11}\cdot H_{t+1}(1|X) +Z_1^{t}\cdot G_{21}\cdot H_{t+1}(1|X) +Z_1^{t}\cdot G_{31}\cdot H_{t+1}(1|X) +Z_1^{t}\cdot G_{41}\cdot H_{t+1}(1|X) Z1t+1=Z1t⋅G11⋅Ht+1(1∣X)+Z1t⋅G21⋅Ht+1(1∣X)+Z1t⋅G31⋅Ht+1(1∣X)+Z1t⋅G41⋅Ht+1(1∣X)
Z 2 t + 1 = Z 2 t ⋅ G 12 ⋅ H t + 1 ( 2 ∣ X ) + Z 2 t ⋅ G 22 ⋅ H t + 1 ( 2 ∣ X ) + Z 2 t ⋅ G 32 ⋅ H t + 1 ( 2 ∣ X ) + Z 2 t ⋅ G 42 ⋅ H t + 1 ( 2 ∣ X ) Z_2^{t+1}=Z_2^{t}\cdot G_{12}\cdot H_{t+1}(2|X) +Z_2^{t}\cdot G_{22}\cdot H_{t+1}(2|X) +Z_2^{t}\cdot G_{32}\cdot H_{t+1}(2|X) +Z_2^{t}\cdot G_{42}\cdot H_{t+1}(2|X) Z2t+1=Z2t⋅G12⋅Ht+1(2∣X)+Z2t⋅G22⋅Ht+1(2∣X)+Z2t⋅G32⋅Ht+1(2∣X)+Z2t⋅G42⋅Ht+1(2∣X)
…
Z i t + 1 = Z i t ⋅ G 1 i ⋅ H t + 1 ( i ∣ X ) + Z i t ⋅ G 2 i ⋅ H t + 1 ( i ∣ X ) + Z i t ⋅ G 3 i ⋅ H t + 1 ( i ∣ X ) + Z i t ⋅ G 4 i ⋅ H t + 1 ( i ∣ X ) Z_i^{t+1}=Z_i^{t}\cdot G_{1i}\cdot H_{t+1}(i|X) +Z_i^{t}\cdot G_{2i}\cdot H_{t+1}(i|X) +Z_i^{t}\cdot G_{3i}\cdot H_{t+1}(i|X) +Z_i^{t}\cdot G_{4i}\cdot H_{t+1}(i|X) Zit+1=Zit⋅G1i⋅Ht+1(i∣X)+Zit⋅G2i⋅Ht+1(i∣X)+Zit⋅G3i⋅Ht+1(i∣X)+Zit⋅G4i⋅Ht+1(i∣X)
其中 G 是对 g ( y i , y j ) g(y_i,y_j) g(yi,yj) 各个元素取指数后的矩阵
G i j = e g ( y i , y j ) G_{ij}=e^{g(y_i,y_j)} Gij=eg(yi,yj)
同理, H t + 1 ( y k ∣ X ) = e h t + 1 ( ( y k ∣ X ) H_{t+1}(y_{k}|X)=e^{h_{t+1}((y_{k}|X)} Ht+1(yk∣X)=eht+1((yk∣X)
对应到代码中
def _forward_alg(self, feats):
# Do the forward algorithm to compute the partition function
init_alphas = torch.full((1, self.tagset_size), -10000.)
# START_TAG has all of the score.
init_alphas[0][self.tag_to_ix[START_TAG]] = 0.
# Wrap in a variable so that we will get automatic backprop
forward_var = init_alphas
# Iterate through the sentence
for feat in feats:
alphas_t = [] # The forward tensors at this timestep
for next_tag in range(self.tagset_size):
# 状态特征函数的得分
emit_score = feat[next_tag].view(1, -1).expand(1, self.tagset_size)
# 状态转移函数的得分
trans_score = self.transitions[next_tag].view(1, -1)
# 从上一个单词的每个状态转移到next_tag状态的得分
# 所以next_tag_var是一个大小为tag_size的数组
next_tag_var = forward_var + trans_score + emit_score
# The forward variable for this tag is log-sum-exp of all the
# scores.
alphas_t.append(log_sum_exp(next_tag_var).view(1))
forward_var = torch.cat(alphas_t).view(1, -1)
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
alpha = log_sum_exp(terminal_var)
return alpha
解释一下:
def log_sum_exp(vec):
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))
def _viterbi_decode(self, feats):
backpointers = []
# Initialize the viterbi variables in log space
init_vvars = torch.full((1, self.tagset_size), -10000.)
init_vvars[0][self.tag_to_ix[START_TAG]] = 0
# forward_var at step i holds the viterbi variables for step i-1
forward_var = init_vvars
for feat in feats:
bptrs_t = [] # holds the backpointers for this step
viterbivars_t = [] # holds the viterbi variables for this step
for next_tag in range(self.tagset_size):
# next_tag_var[i] holds the viterbi variable for tag i at the
# previous step, plus the score of transitioning
# from tag i to next_tag.
# We don't include the emission scores here because the max
# does not depend on them (we add them in below)
next_tag_var = forward_var + self.transitions[next_tag]
best_tag_id = argmax(next_tag_var)
bptrs_t.append(best_tag_id)
viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
# Now add in the emission scores, and assign forward_var to the set
# of viterbi variables we just computed
forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
backpointers.append(bptrs_t)
# Transition to STOP_TAG
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
best_tag_id = argmax(terminal_var)
path_score = terminal_var[0][best_tag_id]
# Follow the back pointers to decode the best path.
best_path = [best_tag_id]
for bptrs_t in reversed(backpointers):
best_tag_id = bptrs_t[best_tag_id]
best_path.append(best_tag_id)
# Pop off the start tag (we dont want to return that to the caller)
start = best_path.pop()
assert start == self.tag_to_ix[START_TAG] # Sanity check
best_path.reverse()
return path_score, best_path
[1] https://www.jiqizhixin.com/articles/2018-05-23-3
[2] https://zhuanlan.zhihu.com/p/71190655
[3] https://zhuanlan.zhihu.com/p/33397147
[4] https://pytorch.org/tutorials/beginner/nlp/advanced_tutorial.html
[5] https://blog.csdn.net/zycxnanwang/article/details/90385259