条件随机场(Conditional Random Field, CRF)是概率图模型(Probabilistic Graphical Model)与区分性分类( Discriminative Classification)的一种接合,能够用来对“结构预测”(structured prediction,e.g. 序列标注)问题进行建模。
如图1,论文 [1] 阐释了 CRF 与其他模型之间的关系。
图1. CRF 与 其他机器学习模型对比【src】
本文我们重点关注输入结点独立的“线性链条件随机场”(Linear-Chain CRF)(如图2)的原理与实现。线性链 CRF 通过与双向 LSTM(Bi-LSTM)的接合,可以用来建模更一般的线性链 CRF(图3),提高模型的建模能力。
给定长度为 m m 的序列, 以及状态集 S S 。 对于任意状态序列 (s1,⋯,sm),si∈S ( s 1 , ⋯ , s m ) , s i ∈ S , 定义其“势”(potential)如下:
根据概率图模型的因子分解理论[1],我们有:
Z=∑s′1,…,s′mψ(s′1,…,s′m) Z = ∑ s 1 ′ , … , s m ′ ψ ( s 1 ′ , … , s m ′ ) 为归一化因子。
同 HMM 类似,CRF 也涉及三类基本问题:评估(计算某一序列的似然值)、解码(给定输入,寻找似然最大的序列)及训练(根据数据估计 CRF 的参数),解决这三个问题也都涉及前向算法、后向算法及 Viterbi 算法。
CRF 的势函数类似于概率,只不过没有归一化,因此这里介绍的 CRF 前向算法、Viterbi 算法、后向算法,同 HMM 基本一致。
定义:
表示,以 s s 结尾的长度为 i i 的子序列的势。
显然, α(1,s)=ψ(∗,s1,1) α ( 1 , s ) = ψ ( ∗ , s 1 , 1 )
根据定义,我们有如下递归关系:
归一化因子可以计算如下:
对于给定的序列 (s1,⋯,sm) ( s 1 , ⋯ , s m ) ,其中条件概率(似然)可以计算:
* 通过前向算法,我们解决了评估问题,计算和空间复杂度为 O(m⋅|S|2) O ( m ⋅ | S | 2 ) 。*
似然的计算过程中,只涉及乘法和加法,都是可导操作。因此,只需要实现前向操作,我们就可以借具有自动梯度功能的学习库(e.g. pytorch、tensorflow)实现基于最大似然准则的训练。一个基于 pytorch 的 CRF 实现见 repo。
import numpy as np
def forward(psi):
m, V, _ = psi.shape
alpha = np.zeros([m, V])
alpha[0] = psi[0, 0, :] # assume psi[0, 0, :] := psi(*,s,1)
for t in range(1, m):
for i in range(V):
'''
for k in range(V):
alpha[t, i] += alpha[t - 1, k] * psi[t, k, i]
'''
alpha[t, i] = np.sum(alpha[t - 1, :] * psi[t, :, i])
return alpha
def pro(seq, psi):
m, V, _ = psi.shape
alpha = forward(psi)
Z = np.sum(alpha[-1])
M = psi[0, 0, seq[0]]
for i in range(1, m):
M *= psi[i, seq[i-1], seq[i]]
p = M / Z
return p
np.random.seed(1111)
V, m = 5, 10
log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
seq = np.random.choice(V, m)
alpha = forward(psi)
p = pro(seq, psi)
print(p)
print(alpha)
2.69869828108e-08
[[ 1.10026295e+00 2.52187760e+00 1.40997704e+00 1.36407554e+00
1.00201186e+00]
[ 1.27679086e+01 1.03890052e+01 1.44699134e+01 1.15244329e+01
1.52767179e+01]
[ 9.30306192e+01 1.09450375e+02 1.26777728e+02 1.28529576e+02
1.16835669e+02]
[ 9.81861108e+02 8.70384204e+02 9.35531558e+02 7.98228277e+02
9.89225754e+02]
[ 6.89790063e+03 8.71016058e+03 8.84778486e+03 9.21051594e+03
6.56093883e+03]
[ 7.56109978e+04 7.00773298e+04 8.60611103e+04 5.63567069e+04
5.99238226e+04]
[ 6.69236243e+05 6.42107210e+05 7.81638452e+05 6.32533145e+05
5.71122492e+05]
[ 6.62242340e+06 5.24446290e+06 5.54750409e+06 4.68782248e+06
4.49353155e+06]
[ 4.31080734e+07 4.09579660e+07 4.62891972e+07 4.60100937e+07
4.63083098e+07]
[ 2.66620185e+08 4.91942550e+08 4.48597546e+08 3.42214705e+08
4.10510463e+08]]
Viterbi 利用动态规划,寻找似然最大的序列。Viterbi 与前向算法非常相似,只是将求和操作替换为最大值操作。
根据定义,我们有如下递归关系:
在所有 |s|m | s | m 条可能的序列中,概率最大的路径的未归一化的值为:
def viterbi_1(psi):
m, V, _ = psi.shape
alpha = np.zeros([V])
trans = np.ones([m, V]).astype('int') * -1
alpha[:] = psi[0, 0, :] # assume psi[0, 0, :] := psi(*,s,1)
for t in range(1, m):
next_alpha = np.zeros([V])
for i in range(V):
tmp = alpha * psi[t, :, i]
next_alpha[i] = np.max(tmp)
trans[t, i] = np.argmax(tmp)
alpha = next_alpha
end = np.argmax(alpha)
path = [end]
for t in range(m - 1, 0, -1):
cur = path[-1]
pre = trans[t, cur]
path.append(pre)
return path[::-1]
def viterbi_2(psi):
m, V, _ = psi.shape
alpha = np.zeros([m, V])
alpha[0] = psi[0, 0, :] # assume psi[0, 0, :] := psi(*,s,1)
for t in range(1, m):
for i in range(V):
tmp = alpha[t - 1, :] * psi[t, :, i]
alpha[t, i] = np.max(tmp)
end = np.argmax(alpha[-1])
path = [end]
for t in range(m - 1, 0, -1):
cur = path[-1]
pre = np.argmax(alpha[t - 1] * psi[t, :, cur])
path.append(pre)
return path[::-1]
np.random.seed(1111)
V, m = 5, 10
log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
path_1 = viterbi_1(psi)
path_2 = viterbi_2(psi)
print(path_1)
print(path_2)
[1, 4, 2, 4, 3, 0, 3, 0, 3, 1]
[1, 4, 2, 4, 3, 0, 3, 0, 3, 1]
为了训练 CRF, 我们需要计算相应的梯度。为了手动计算梯度(这也为后续优化打开大门),需要用到后向算法。
定义:
其中,令 β(m,s)=1 β ( m , s ) = 1 。
可以认为序列结尾存在特殊的符号。为简单起见,不讨论结尾边界的特殊性,可以都参考前向边界的处理及参见实现。
根据定义,我们有如下递归关系:
def backward(psi):
m, V, _ = psi.shape
beta = np.zeros([m, V])
beta[-1] = 1
for t in range(m - 2, -1, -1):
for i in range(V):
'''
for k in range(V):
beta[t, i] += beta[t + 1, k] * psi[t + 1, i, k]
'''
beta[t, i] = np.sum(beta[t + 1, :] * psi[t + 1, i, :])
return beta
np.random.seed(1111)
V, m = 5, 10
log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
seq = np.random.choice(V, m)
beta = backward(psi)
print(beta)
[[ 2.95024144e+08 2.61620644e+08 3.16953747e+08 2.02959597e+08
2.51250862e+08]
[ 2.73494359e+07 2.31521489e+07 3.62404054e+07 2.84752625e+07
3.38820012e+07]
[ 2.92799244e+06 3.00539203e+06 4.18174216e+06 3.30814155e+06
3.45104724e+06]
[ 4.40588351e+05 4.18060894e+05 3.95721271e+05 4.50117410e+05
4.38635065e+05]
[ 4.51172884e+04 5.40496888e+04 4.37931199e+04 4.98898498e+04
5.04357771e+04]
[ 6.50740169e+03 5.21859026e+03 5.66773856e+03 4.73895449e+03
5.79578682e+03]
[ 4.83173340e+02 5.36538120e+02 6.01820173e+02 7.07538756e+02
6.54966046e+02]
[ 7.60936291e+01 7.90609361e+01 9.08681883e+01 5.80503199e+01
5.89976569e+01]
[ 8.15414542e+00 7.95904764e+00 9.64664115e+00 8.69502743e+00
9.41073532e+00]
[ 1.00000000e+00 1.00000000e+00 1.00000000e+00 1.00000000e+00
1.00000000e+00]]
对于 i=1 i = 1 的边界情况:
对于路径 (s1,⋯,sm) ( s 1 , ⋯ , s m ) ,
记分子 ∏mi=1ψ(si−1,si,i)=M ∏ i = 1 m ψ ( s i − 1 , s i , i ) = M 则:
其中, δtrue=1,δfalse=0 δ t r u e = 1 , δ f a l s e = 0 。
def gradient(seq, psi):
m, V, _ = psi.shape
grad = np.zeros_like(psi)
alpha = forward(psi)
beta = backward(psi)
Z = np.sum(alpha[-1])
for t in range(1, m):
for i in range(V):
for j in range(V):
grad[t, i, j] = -alpha[t - 1, i] * beta[t, j] / Z
if i == seq[t - 1] and j == seq[t]:
grad[t, i, j] += 1. / psi[t, i, j]
# corner cases
grad[0, 0, :] = -beta[0, :] / Z
grad[0, 0, seq[0]] += 1. / psi[0, 0, seq[0]]
return grad
np.random.seed(1111)
V, m = 5, 10
log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
seq = np.random.choice(V, m)
grad = gradient(seq, psi)
print(grad[0, :, :])
[[ 0.75834232 -0.13348772 -0.16172055 -0.10355687 -0.12819671]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]]
def check_grad(seq, psi, i, j, k, toleration=1e-5, delta=1e-10):
m, V, _ = psi.shape
grad_1 = gradient(seq, psi)[i, j, k]
original = psi[i, j, k]
# p1
psi[i, j, k] = original - delta
p1 = np.log(pro(seq, psi))
# p2
psi[i, j, k] = original + delta
p2 = np.log(pro(seq, psi))
psi[i, j, k] = original
grad_2 = (p2 - p1) / (2 * delta)
diff = np.abs(grad_1 - grad_2)
if diff > toleration:
print("%d, %d, %d, %.2e, %.2e, %.2e" % (i, j, k, grad_1, grad_2, diff))
np.random.seed(1111)
V, m = 5, 10
log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
seq = np.random.choice(V, m)
print(seq)
for toleration in [1e-4, 5e-5, 1.5e-5]:
print(toleration)
for i in range(m):
for j in range(V):
for k in range(V):
check_grad(seq, psi, i, j, k, toleration)
[0 1 4 1 3 0 0 3 3 1]
0.0001
5e-05
1.5e-05
2, 1, 2, -2.22e-02, -2.22e-02, 1.55e-05
4, 3, 3, -2.03e-02, -2.03e-02, 1.55e-05
首先定义基本的 log 域加法操作(参见)。
ninf = -np.float('inf')
def _logsumexp(a, b):
'''
np.log(np.exp(a) + np.exp(b))
'''
if a < b:
a, b = b, a
if b == ninf:
return a
else:
return a + np.log(1 + np.exp(b - a))
def logsumexp(*args):
'''
from scipy.special import logsumexp
logsumexp(args)
'''
res = args[0]
for e in args[1:]:
res = _logsumexp(res, e)
return res
def forward_log(log_psi):
m, V, _ = log_psi.shape
log_alpha = np.ones([m, V]) * ninf
log_alpha[0] = log_psi[0, 0, :] # assume psi[0, 0, :] := psi(*,s,1)
for t in range(1, m):
for i in range(V):
for j in range(V):
log_alpha[t, j] = logsumexp(log_alpha[t, j], log_alpha[t - 1, i] + log_psi[t, i, j])
return log_alpha
def pro_log(seq, log_psi):
m, V, _ = log_psi.shape
log_alpha = forward_log(log_psi)
log_Z = logsumexp(*[e for e in log_alpha[-1]])
log_M = log_psi[0, 0, seq[0]]
for i in range(1, m):
log_M = log_M + log_psi[i, seq[i - 1], seq[i]]
log_p = log_M - log_Z
return log_p
np.random.seed(1111)
V, m = 5, 10
log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
seq = np.random.choice(V, m)
alpha = forward(psi)
log_alpha = forward_log(log_psi)
print(np.sum(np.abs(np.log(alpha) - log_alpha)))
p = pro(seq, psi)
log_p = pro_log(seq, log_psi)
print(np.sum(np.abs(np.log(p) - log_p)))
3.03719722983e-14
0.0
def backward_log(log_psi):
m, V, _ = log_psi.shape
log_beta = np.ones([m, V]) * ninf
log_beta[-1] = 0
for t in range(m - 2, -1, -1):
for i in range(V):
for j in range(V):
log_beta[t, i] = logsumexp(log_beta[t, i], log_beta[t + 1, j] + log_psi[t + 1, i, j])
return log_beta
np.random.seed(1111)
V, m = 5, 10
log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
seq = np.random.choice(V, m)
beta = backward(psi)
log_beta = backward_log(log_psi)
print(np.sum(np.abs(beta - np.exp(log_beta))))
print(np.sum(np.abs(log_beta - np.log(beta))))
1.46851337579e-06
1.86517468137e-14
def gradient_log(seq, log_psi):
m, V, _ = log_psi.shape
grad = np.zeros_like(log_psi)
log_alpha = forward_log(log_psi)
log_beta = backward_log(log_psi)
log_Z = logsumexp(*[e for e in log_alpha[-1]])
for t in range(1, m):
for i in range(V):
for j in range(V):
grad[t, i, j] -= np.exp(log_alpha[t - 1, i] + log_beta[t, j] - log_Z)
if i == seq[t - 1] and j == seq[t]:
grad[t, i, j] += np.exp(-log_psi[t, i, j])
# corner cases
grad[0, 0, :] -= np.exp(log_beta[0, :] - log_Z)
grad[0, 0, seq[0]] += np.exp(-log_psi[0, 0, seq[0]])
return grad
np.random.seed(1111)
V, m = 5, 10
log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
seq = np.random.choice(V, m)
grad_1 = gradient(seq, psi)
grad_2 = gradient_log(seq, log_psi)
print(grad_1[0, :, :])
print(grad_2[0, :, :])
print(np.sum(np.abs(grad_1 - grad_2)))
[[ 0.75834232 -0.13348772 -0.16172055 -0.10355687 -0.12819671]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]]
[[ 0.75834232 -0.13348772 -0.16172055 -0.10355687 -0.12819671]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. ]]
1.11508025036e-14
在 log 域, 我们一般直接计算目标函数相对与 lnψ ln ψ 的梯度计算公式如下:
只需将上面的 grad_log 稍做改动即可,不再赘述。
目前为止,我们都假设函数已经知道,在此基础上推导 CRF 的相关计算。理论上,除了非负性的要求 ,势函数可以灵活的选择。为也便于计算和训练,CRF 中一般选择指数的形式。假设输入为 x1,…,xm x 1 , … , x m ,则势函数定义为:
则
其中, ϕ(x1,…,xm,s′,s,i)∈Rd ϕ ( x 1 , … , x m , s ′ , s , i ) ∈ R d 是特征向量, w∈Rd w ∈ R d 是参数向量。
对于线性链模型,简化势函数为:
转移势函数定义为:
发射势函数定义为:
则:
如果我们取对数,则我们得到一个线性模型,定义:
则
具体的,可以定义
如果 x=(x1,⋯,xm)∈Rm x = ( x 1 , ⋯ , x m ) ∈ R m ,则有:
这里,为简单起见,我们令 xi x i 是一个标量,实际中 xi x i 往往是向量。
从 x x 到 logψ log ψ 再到 ψ ψ 都是可导的操作(四则运算和指数、对数运算),而 ψ ψ 的梯度我们上面已经推导可以求得。因此,我们可以利用误差反传计算 W W 等参数的梯度,从而利用 SGD 等优化方法训练包括 CRF 在内的整个模型的参数。
def score(seq, x, W, P, S):
m = len(seq)
V = len(W)
log_psi = np.zeros([m, V, V])
# corner cases
for i in range(V):
# emit
log_psi[0, 0, i] += S[i]
# transmit
log_psi[0, 0, i] += x[0] * W[i]
for t in range(1, m):
for i in range(V):
for j in range(V):
# emit
log_psi[t, i, j] += x[t] * W[j]
# transmit
log_psi[t, i, j] += P[i, j]
return log_psi
def gradient_param(seq, x, W, P, S):
m = len(seq)
V = len(W)
log_psi = score(seq, x, W, P, S)
grad_psi = gradient_log(seq, log_psi)
grad_log_psi = np.exp(log_psi) * grad_psi
grad_x = np.zeros_like(x)
grad_W = np.zeros_like(W)
grad_P = np.zeros_like(P)
grad_S = np.zeros_like(S)
# corner cases
for i in range(V):
# emit
grad_S[i] += grad_log_psi[0, 0, i]
# transmit
grad_W[i] += grad_log_psi[0, 0, i] * x[0]
grad_x[0] += grad_log_psi[0, 0, i] * W[i]
for t in range(1, m):
for i in range(V):
for j in range(V):
# emit
grad_W[j] += grad_log_psi[t, i, j] * x[t]
grad_x[t] += grad_log_psi[t, i, j] * W[j]
# transmit
grad_P[i, j] += grad_log_psi[t, i, j]
return grad_x, grad_W, grad_P, grad_S
np.random.seed(1111)
V, m = 5, 7
seq = np.random.choice(V, m)
x = np.random.random(m)
W = np.random.random(V)
P = np.random.random([V, V])
S = np.random.random(V)
grad_x, grad_W, grad_P, grad_S = gradient_param(seq, x, W, P, S)
print(grad_x)
print(grad_W)
print(grad_P)
print(grad_S)
[ 0.03394788 -0.11666261 0.02592661 0.07931277 0.02549323 0.11371901
0.02198856]
[-0.62291675 -0.38050215 -0.18983737 -0.65300231 1.84625859]
[[-0.34655117 -0.27314013 -0.16800195 -0.28352514 0.73359469]
[-0.22747135 -0.2967193 -0.27009443 -0.2664594 0.87349324]
[-0.27906702 -0.27747362 -0.33689934 -0.18786182 0.82788735]
[-0.2701056 -0.16940564 -0.2624276 -0.29133856 -0.25558298]
[ 0.72105085 0.86080584 0.76931185 -0.2103895 -0.11362927]]
[-0.17736447 -0.21489701 -0.20747999 -0.19735031 0.79709179]
梯度正确性检验如下:
def check_grad(seq, x, W, P, S, toleration=1e-5, delta=1e-10):
m, V, _ = psi.shape
grad_x, grad_W, grad_P, grad_S = gradient_param(seq, x, W, P, S)
def llk(seq, x, W, P, S):
log_psi = score(seq, x, W, P, S)
spi = np.exp(log_psi)
log_p = np.log(pro(seq, spi))
return log_p
# grad_x
print('Check X')
for i in range(len(x)):
original = x[i]
grad_1 = grad_x[i]
# p1
x[i] = original - delta
p1 = llk(seq, x, W, P, S)
# p2
x[i] = original + delta
p2 = llk(seq, x, W, P, S)
x[i] = original
grad_2 = (p2 - p1) / (2 * delta)
diff = np.abs(grad_1 - grad_2) / np.abs(grad_2)
if diff > toleration:
print("%d, %.2e, %.2e, %.2e" % (i, grad_1, grad_2, diff))
# grad_W
print('Check W')
for i in range(len(W)):
original = W[i]
grad_1 = grad_W[i]
# p1
W[i] = original - delta
p1 = llk(seq, x, W, P, S)
# p2
W[i] = original + delta
p2 = llk(seq, x, W, P, S)
W[i] = original
grad_2 = (p2 - p1) / (2 * delta)
diff = np.abs(grad_1 - grad_2) / np.abs(grad_2)
if diff > toleration:
print("%d, %.2e, %.2e, %.2e" % (i, grad_1, grad_2, diff))
# grad_P
print('Check P')
for i in range(V):
for j in range(V):
original = P[i][j]
grad_1 = grad_P[i][j]
# p1
P[i][j] = original - delta
p1 = llk(seq, x, W, P, S)
# p2
P[i][j] = original + delta
p2 = llk(seq, x, W, P, S)
P[i][j] = original
grad_2 = (p2 - p1) / (2 * delta)
diff = np.abs(grad_1 - grad_2) / np.abs(grad_2)
if diff > toleration:
print("%d, %.2e, %.2e, %.2e" % (i, grad_1, grad_2, diff))
# grad_S
print('Check S')
for i in range(len(S)):
original = S[i]
grad_1 = grad_S[i]
# p1
S[i] = original - delta
p1 = llk(seq, x, W, P, S)
# p2
S[i] = original + delta
p2 = llk(seq, x, W, P, S)
S[i] = original
grad_2 = (p2 - p1) / (2 * delta)
diff = np.abs(grad_1 - grad_2) / np.abs(grad_2)
if diff > toleration:
print("%d, %.2e, %.2e, %.2e" % (i, grad_1, grad_2, diff))
np.random.seed(1111)
V, m = 5, 10
seq = np.random.choice(V, m)
x = np.random.random(m)
W = np.random.random(V)
P = np.random.random([V, V])
S = np.random.random(V)
check_grad(seq, x, W, P, S)
Check X
1, 6.75e-02, 6.75e-02, 5.74e-05
2, 5.14e-01, 5.14e-01, 1.47e-05
3, -3.17e-01, -3.17e-01, 1.51e-05
5, -6.42e-02, -6.42e-02, 7.82e-05
8, -4.38e-02, -4.38e-02, 1.08e-04
Check W
0, -6.55e-01, -6.55e-01, 1.13e-05
2, -1.33e-03, -1.33e-03, 3.77e-04
3, 5.88e-02, 5.89e-02, 1.15e-04
Check P
0, -4.50e-01, -4.51e-01, 1.03e-05
0, -2.70e-01, -2.70e-01, 2.53e-05
1, -2.11e-01, -2.11e-01, 3.13e-05
1, -2.35e-01, -2.35e-01, 1.80e-05
2, -2.93e-01, -2.93e-01, 1.76e-05
2, -1.50e-01, -1.50e-01, 2.15e-05
2, -1.72e-01, -1.72e-01, 3.40e-05
2, -3.48e-01, -3.48e-01, 1.02e-05
3, -1.90e-01, -1.90e-01, 3.10e-05
3, -3.60e-01, -3.60e-01, 1.78e-05
4, 5.47e-01, 5.47e-01, 1.50e-05
Check S
0, -2.02e-01, -2.02e-01, 2.13e-05
1, -1.97e-01, -1.97e-01, 1.82e-05
2, -1.05e-01, -1.05e-01, 6.22e-05
CRF 是强大的序列学习准则。配合双向循环神经网络(e.g. Bi-LSTM)的特征表征和学习能力,在许多序列学习任务上都取得了领先的结果[5~7]。
基本模型如下:
图4. Bi-LSTM CRF 模型【src】
Bi-LSTM 对整个输入序列进行特征提取和建模,用非线性的模型建模发射得分;转移得分用另外的 P P 表示,作为 CRF 自身的参数。相对于常规的用于神经网络训练的目标函数,CRF 是带参数的损失函数。
基于 pytorch 的 CRFLoss 实现见 repo 以及[3, 4],BiLSTM + CRF 的 实现应用见[8]。