【Learning Notes】线性链条件随机场(CRF)原理及实现

1. 概述

条件随机场(Conditional Random Field, CRF)是概率图模型(Probabilistic Graphical Model)与区分性分类( Discriminative Classification)的一种接合,能够用来对“结构预测”(structured prediction,e.g. 序列标注)问题进行建模。

如图1,论文 [1] 阐释了 CRF 与其他模型之间的关系。


图1. CRF 与 其他机器学习模型对比【src】

本文我们重点关注输入结点独立的“线性链条件随机场”(Linear-Chain CRF)(如图2)的原理与实现。线性链 CRF 通过与双向 LSTM(Bi-LSTM)的接合,可以用来建模更一般的线性链 CRF(图3),提高模型的建模能力。

【Learning Notes】线性链条件随机场(CRF)原理及实现_第1张图片
图2. 简单的线性链 CRF【src】

【Learning Notes】线性链条件随机场(CRF)原理及实现_第2张图片
图3. 一般性线性链 CRF【src】

2. CRF 算法

2.1 模型形式化

给定长度为 m m 的序列, 以及状态集 S S 。 对于任意状态序列 (s1,,sm),siS ( s 1 , ⋯ , s m ) , s i ∈ S , 定义其“势”(potential)如下:

ψ(s1,,sm)=i=1mψ(si1,si,i) ψ ( s 1 , … , s m ) = ∏ i = 1 m ψ ( s i − 1 , s i , i )

我们定义 s0 s 0 为特殊的开始符号 。这里对 s,sS,i1,,m s , s ′ ∈ S , i ∈ 1 , … , m ,势函数 ψ(s,s,i)0 ψ ( s , s ′ , i ) ≥ 0 。也即,势函数是非负的,它对序列第 i i 位置发生的 s s s s ′ 的状态转移都给出一个非负值。

根据概率图模型的因子分解理论[1],我们有:

p(s1,,sm|x1,,xm)=ψ(s1,,sm)s1,,smψ(s1,,sm) p ( s 1 , … , s m | x 1 , … , x m ) = ψ ( s 1 , … , s m ) ∑ s 1 ′ , … , s m ′ ψ ( s 1 ′ , … , s m ′ )

Z=s1,,smψ(s1,,sm) Z = ∑ s 1 ′ , … , s m ′ ψ ( s 1 ′ , … , s m ′ ) 为归一化因子。

同 HMM 类似,CRF 也涉及三类基本问题:评估(计算某一序列的似然值)、解码(给定输入,寻找似然最大的序列)及训练(根据数据估计 CRF 的参数),解决这三个问题也都涉及前向算法、后向算法及 Viterbi 算法。

CRF 的势函数类似于概率,只不过没有归一化,因此这里介绍的 CRF 前向算法、Viterbi 算法、后向算法,同 HMM 基本一致。

2.2 前向算法

定义:

α(i,s)=s1,,si1ψ(s1,,si1,s) α ( i , s ) = ∑ s 1 , … , s i − 1 ψ ( s 1 , … , s i − 1 , s )

表示,以 s s 结尾的长度为 i i 的子序列的势。

显然, α(1,s)=ψ(,s1,1) α ( 1 , s ) = ψ ( ∗ , s 1 , 1 )

根据定义,我们有如下递归关系:

α(i,s)=sSα(i1,s)×ψ(s,s,i) α ( i , s ) = ∑ s ′ ∈ S α ( i − 1 , s ′ ) × ψ ( s ′ , s , i )

归一化因子可以计算如下:

Z=s1,,smψ(s1,sm)=sSs1,,sm1ψ(s1,sm1,s)=sSα(m,s) Z = ∑ s 1 , … , s m ψ ( s 1 , … s m ) = ∑ s ∈ S ∑ s 1 , … , s m − 1 ψ ( s 1 , … s m − 1 , s ) = ∑ s ∈ S α ( m , s )

对于给定的序列 (s1,,sm) ( s 1 , ⋯ , s m ) ,其中条件概率(似然)可以计算:

p(s1,,sm|x1,,xm)=mi=1ψ(si1,si,i)sSα(m,s) p ( s 1 , … , s m | x 1 , … , x m ) = ∏ i = 1 m ψ ( s i − 1 , s i , i ) ∑ s ∈ S α ( m , s )

* 通过前向算法,我们解决了评估问题,计算和空间复杂度为 O(m|S|2) O ( m ⋅ | S | 2 ) 。*

似然的计算过程中,只涉及乘法和加法,都是可导操作。因此,只需要实现前向操作,我们就可以借具有自动梯度功能的学习库(e.g. pytorch、tensorflow)实现基于最大似然准则的训练。一个基于 pytorch 的 CRF 实现见 repo。

import numpy as np

def forward(psi):
    m, V, _ = psi.shape

    alpha = np.zeros([m, V])
    alpha[0] = psi[0, 0, :]  # assume psi[0, 0, :] := psi(*,s,1)

    for t in range(1, m):
        for i in range(V):
            '''
            for k in range(V):
                alpha[t, i] += alpha[t - 1, k] * psi[t, k, i]
            '''
            alpha[t, i] = np.sum(alpha[t - 1, :] * psi[t, :, i])

    return alpha

def pro(seq, psi):
    m, V, _ = psi.shape
    alpha = forward(psi)

    Z = np.sum(alpha[-1])
    M = psi[0, 0, seq[0]]
    for i in range(1, m):
        M *= psi[i, seq[i-1], seq[i]]

    p = M / Z
    return p

np.random.seed(1111)
V, m = 5, 10

log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi)  # nonnegative
seq = np.random.choice(V, m)

alpha = forward(psi)
p = pro(seq, psi)
print(p)
print(alpha)
2.69869828108e-08
[[  1.10026295e+00   2.52187760e+00   1.40997704e+00   1.36407554e+00
    1.00201186e+00]
 [  1.27679086e+01   1.03890052e+01   1.44699134e+01   1.15244329e+01
    1.52767179e+01]
 [  9.30306192e+01   1.09450375e+02   1.26777728e+02   1.28529576e+02
    1.16835669e+02]
 [  9.81861108e+02   8.70384204e+02   9.35531558e+02   7.98228277e+02
    9.89225754e+02]
 [  6.89790063e+03   8.71016058e+03   8.84778486e+03   9.21051594e+03
    6.56093883e+03]
 [  7.56109978e+04   7.00773298e+04   8.60611103e+04   5.63567069e+04
    5.99238226e+04]
 [  6.69236243e+05   6.42107210e+05   7.81638452e+05   6.32533145e+05
    5.71122492e+05]
 [  6.62242340e+06   5.24446290e+06   5.54750409e+06   4.68782248e+06
    4.49353155e+06]
 [  4.31080734e+07   4.09579660e+07   4.62891972e+07   4.60100937e+07
    4.63083098e+07]
 [  2.66620185e+08   4.91942550e+08   4.48597546e+08   3.42214705e+08
    4.10510463e+08]]

2.3 Viterbi 解码

Viterbi 利用动态规划,寻找似然最大的序列。Viterbi 与前向算法非常相似,只是将求和操作替换为最大值操作。

α(j,s)=maxs1,,sj1ψ(s1,,sj1,s) α ( j , s ) = m a x s 1 , … , s j − 1 ψ ( s 1 , … , s j − 1 , s )

显然, α(1,s)=ψ(,s1,1) α ( 1 , s ) = ψ ( ∗ , s 1 , 1 )

根据定义,我们有如下递归关系:

α(j,s)=maxsS α(j1,s)ψ(s,s,j) α ( j , s ) = m a x s ′ ∈ S   α ( j − 1 , s ′ ) ⋅ ψ ( s ′ , s , j )

在所有 |s|m | s | m 条可能的序列中,概率最大的路径的未归一化的值为:

maxα(m,s) max α ( m , s )

沿着前向推导的反方向,可以得到最优的路径,算法复杂度是 O(m|S|2) O ( m ⋅ | S | 2 ) 。demo 实现如下:

def viterbi_1(psi):
    m, V, _ = psi.shape

    alpha = np.zeros([V])
    trans = np.ones([m, V]).astype('int') * -1

    alpha[:] = psi[0, 0, :]  # assume psi[0, 0, :] := psi(*,s,1)

    for t in range(1, m):
        next_alpha = np.zeros([V])
        for i in range(V):
            tmp = alpha * psi[t, :, i]
            next_alpha[i] = np.max(tmp)
            trans[t, i] = np.argmax(tmp)
        alpha = next_alpha

    end = np.argmax(alpha)
    path = [end]
    for t in range(m - 1, 0, -1):
        cur = path[-1]
        pre = trans[t, cur]
        path.append(pre)

    return path[::-1]

def viterbi_2(psi):
    m, V, _ = psi.shape

    alpha = np.zeros([m, V])
    alpha[0] = psi[0, 0, :]  # assume psi[0, 0, :] := psi(*,s,1)
    for t in range(1, m):
        for i in range(V):
            tmp = alpha[t - 1, :] * psi[t, :, i]
            alpha[t, i] = np.max(tmp)

    end = np.argmax(alpha[-1])
    path = [end]
    for t in range(m - 1, 0, -1):
        cur = path[-1]
        pre = np.argmax(alpha[t - 1] * psi[t, :, cur])
        path.append(pre)

    return path[::-1]

np.random.seed(1111)
V, m = 5, 10

log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi)  # nonnegative

path_1 = viterbi_1(psi)
path_2 = viterbi_2(psi)
print(path_1)
print(path_2)
[1, 4, 2, 4, 3, 0, 3, 0, 3, 1]
[1, 4, 2, 4, 3, 0, 3, 0, 3, 1]

2.4 后向算法

为了训练 CRF, 我们需要计算相应的梯度。为了手动计算梯度(这也为后续优化打开大门),需要用到后向算法。

定义:

β(j,s)=sj+1,,smψ(sj+1,,sm|sj=s) β ( j , s ) = ∑ s j + 1 , … , s m ψ ( s j + 1 , … , s m | s j = s )

其中,令 β(m,s)=1 β ( m , s ) = 1

可以认为序列结尾存在特殊的符号。为简单起见,不讨论结尾边界的特殊性,可以都参考前向边界的处理及参见实现。

根据定义,我们有如下递归关系:

β(j,s)=sSβ(j+1,s)ψ(s,s,j+1) β ( j , s ) = ∑ s ′ ∈ S β ( j + 1 , s ′ ) ⋅ ψ ( s , s ′ , j + 1 )

def backward(psi):
    m, V, _ = psi.shape

    beta = np.zeros([m, V])
    beta[-1] = 1

    for t in range(m - 2, -1, -1):
        for i in range(V):
            '''
            for k in range(V):
                beta[t, i] += beta[t + 1, k] * psi[t + 1, i, k]
            '''
            beta[t, i] = np.sum(beta[t + 1, :] * psi[t + 1, i, :])

    return beta

np.random.seed(1111)
V, m = 5, 10

log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi)  # nonnegative
seq = np.random.choice(V, m)

beta = backward(psi)
print(beta)
[[  2.95024144e+08   2.61620644e+08   3.16953747e+08   2.02959597e+08
    2.51250862e+08]
 [  2.73494359e+07   2.31521489e+07   3.62404054e+07   2.84752625e+07
    3.38820012e+07]
 [  2.92799244e+06   3.00539203e+06   4.18174216e+06   3.30814155e+06
    3.45104724e+06]
 [  4.40588351e+05   4.18060894e+05   3.95721271e+05   4.50117410e+05
    4.38635065e+05]
 [  4.51172884e+04   5.40496888e+04   4.37931199e+04   4.98898498e+04
    5.04357771e+04]
 [  6.50740169e+03   5.21859026e+03   5.66773856e+03   4.73895449e+03
    5.79578682e+03]
 [  4.83173340e+02   5.36538120e+02   6.01820173e+02   7.07538756e+02
    6.54966046e+02]
 [  7.60936291e+01   7.90609361e+01   9.08681883e+01   5.80503199e+01
    5.89976569e+01]
 [  8.15414542e+00   7.95904764e+00   9.64664115e+00   8.69502743e+00
    9.41073532e+00]
 [  1.00000000e+00   1.00000000e+00   1.00000000e+00   1.00000000e+00
    1.00000000e+00]]

2.5 梯度计算

Z=s1,,smψ(s1,,sm)=si1S,siSsi1=si1,si=siψ(s1,,sm)=si1S,siSα(i1,si1)β(i,si)ψ(si1,si,i)   1<im Z = ∑ s 1 , … , s m ψ ( s 1 , … , s m ) = ∑ s i − 1 ′ ∈ S , s i ′ ∈ S ∑ s i − 1 = s i − 1 ′ , s i = s i ′ ψ ( s 1 , … , s m ) = ∑ s i − 1 ′ ∈ S , s i ′ ∈ S α ( i − 1 , s i − 1 ′ ) ⋅ β ( i , s i ′ ) ⋅ ψ ( s i − 1 ′ , s i ′ , i )       1 < i ≤ m

对于 i=1 i = 1 的边界情况:

Z=s1Sβ(1,si)ψ(,s1,1) Z = ∑ s 1 ′ ∈ S β ( 1 , s i ′ ) ⋅ ψ ( ∗ , s 1 ′ , 1 )

对于路径 (s1,,sm) ( s 1 , ⋯ , s m )

p(s1,,sm|x1,,xm)=ψ(s1,,sm)Z=mi=1ψ(si1,si,i)Z=mi=1ψisi1,siZ p ( s 1 , … , s m | x 1 , … , x m ) = ψ ( s 1 , … , s m ) Z = ∏ i = 1 m ψ ( s i − 1 , s i , i ) Z = ∏ i = 1 m ψ s i − 1 , s i i Z

其中, ψis,s=ψ(s,s,i), s,sS ψ s ′ , s i = ψ ( s ′ , s , i ) ,   s ′ , s ∈ S

记分子 mi=1ψ(si1,si,i)=M ∏ i = 1 m ψ ( s i − 1 , s i , i ) = M 则:

p(s1,,sm|x1,,xm)ψks,s=1Z[Mψks,sδs=sk1&s=skpα(k1,s)β(k,s)] ∂ p ( s 1 , … , s m | x 1 , … , x m ) ∂ ψ s ′ , s k = 1 Z [ M ψ s ′ , s k ⋅ δ s ′ = s k − 1 & s = s k − p ⋅ α ( k − 1 , s ′ ) ⋅ β ( k , s ) ]

其中, δtrue=1,δfalse=0 δ t r u e = 1 , δ f a l s e = 0

lnp(s1,,sm|x1,,xm)ψks,s=1pp(s1,,sm|x1,,xm)ψks,s=δs=sk1&amp;s=skψks,s1Zα(k1,s)β(k,s) ∂ ln ⁡ p ( s 1 , … , s m | x 1 , … , x m ) ∂ ψ s ′ , s k = 1 p ⋅ ∂ p ( s 1 , … , s m | x 1 , … , x m ) ∂ ψ s ′ , s k = δ s ′ = s k − 1 & a m p ; s = s k ψ s ′ , s k − 1 Z α ( k − 1 , s ′ ) ⋅ β ( k , s )

def gradient(seq, psi):
    m, V, _ = psi.shape

    grad = np.zeros_like(psi)
    alpha = forward(psi)
    beta = backward(psi)

    Z = np.sum(alpha[-1])

    for t in range(1, m):
        for i in range(V):
            for j in range(V):
                grad[t, i, j] = -alpha[t - 1, i] * beta[t, j] / Z

                if i == seq[t - 1] and j == seq[t]:
                    grad[t, i, j] += 1. / psi[t, i, j]

    # corner cases
    grad[0, 0, :] = -beta[0, :] / Z
    grad[0, 0, seq[0]] += 1. / psi[0, 0, seq[0]]

    return grad

np.random.seed(1111)
V, m = 5, 10

log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi)  # nonnegative
seq = np.random.choice(V, m)

grad = gradient(seq, psi)
print(grad[0, :, :])
[[ 0.75834232 -0.13348772 -0.16172055 -0.10355687 -0.12819671]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]
def check_grad(seq, psi, i, j, k, toleration=1e-5, delta=1e-10):
    m, V, _ = psi.shape

    grad_1 = gradient(seq, psi)[i, j, k]

    original = psi[i, j, k]

    # p1
    psi[i, j, k] = original - delta
    p1 = np.log(pro(seq, psi))

    # p2
    psi[i, j, k] = original + delta
    p2 = np.log(pro(seq, psi))

    psi[i, j, k] = original
    grad_2 = (p2 - p1) / (2 * delta)

    diff = np.abs(grad_1 - grad_2)
    if diff > toleration:
        print("%d, %d, %d, %.2e, %.2e, %.2e" % (i, j, k, grad_1, grad_2, diff))

np.random.seed(1111)
V, m = 5, 10

log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi) # nonnegative
seq = np.random.choice(V, m)
print(seq)

for toleration in [1e-4, 5e-5, 1.5e-5]:
    print(toleration)
    for i in range(m):
        for j in range(V):
            for k in range(V):
                check_grad(seq, psi, i, j, k, toleration)
[0 1 4 1 3 0 0 3 3 1]
0.0001
5e-05
1.5e-05
2, 1, 2, -2.22e-02, -2.22e-02, 1.55e-05
4, 3, 3, -2.03e-02, -2.03e-02, 1.55e-05

首先定义基本的 log 域加法操作(参见)。

ninf = -np.float('inf')

def _logsumexp(a, b):
    '''
    np.log(np.exp(a) + np.exp(b))

    '''

    if a < b:
        a, b = b, a

    if b == ninf:
        return a
    else:
        return a + np.log(1 + np.exp(b - a)) 

def logsumexp(*args):
    '''
    from scipy.special import logsumexp
    logsumexp(args)
    '''
    res = args[0]
    for e in args[1:]:
        res = _logsumexp(res, e)
    return res
def forward_log(log_psi):
    m, V, _ = log_psi.shape

    log_alpha = np.ones([m, V]) * ninf
    log_alpha[0] = log_psi[0, 0, :]  # assume psi[0, 0, :] := psi(*,s,1)

    for t in range(1, m):
        for i in range(V):
            for j in range(V):
                log_alpha[t, j] = logsumexp(log_alpha[t, j], log_alpha[t - 1, i] + log_psi[t, i, j])

    return log_alpha

def pro_log(seq, log_psi):
    m, V, _ = log_psi.shape
    log_alpha = forward_log(log_psi)

    log_Z = logsumexp(*[e for e in log_alpha[-1]])
    log_M = log_psi[0, 0, seq[0]]
    for i in range(1, m):
        log_M = log_M + log_psi[i, seq[i - 1], seq[i]]

    log_p = log_M - log_Z
    return log_p

np.random.seed(1111)
V, m = 5, 10

log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi)  # nonnegative
seq = np.random.choice(V, m)

alpha = forward(psi)
log_alpha = forward_log(log_psi)
print(np.sum(np.abs(np.log(alpha) - log_alpha)))

p = pro(seq, psi)
log_p = pro_log(seq, log_psi)
print(np.sum(np.abs(np.log(p) - log_p)))
3.03719722983e-14
0.0
def backward_log(log_psi):
    m, V, _ = log_psi.shape

    log_beta = np.ones([m, V]) * ninf
    log_beta[-1] = 0

    for t in range(m - 2, -1, -1):
        for i in range(V):
            for j in range(V):
                log_beta[t, i] = logsumexp(log_beta[t, i], log_beta[t + 1, j] + log_psi[t + 1, i, j])

    return log_beta

np.random.seed(1111)
V, m = 5, 10

log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi)  # nonnegative
seq = np.random.choice(V, m)

beta = backward(psi)
log_beta = backward_log(log_psi)

print(np.sum(np.abs(beta - np.exp(log_beta))))
print(np.sum(np.abs(log_beta - np.log(beta))))
1.46851337579e-06
1.86517468137e-14
def gradient_log(seq, log_psi):
    m, V, _ = log_psi.shape

    grad = np.zeros_like(log_psi)
    log_alpha = forward_log(log_psi)
    log_beta = backward_log(log_psi)

    log_Z = logsumexp(*[e for e in log_alpha[-1]])
    for t in range(1, m):
        for i in range(V):
            for j in range(V):
                grad[t, i, j] -= np.exp(log_alpha[t - 1, i] + log_beta[t, j] - log_Z)
                if i == seq[t - 1] and j == seq[t]:
                    grad[t, i, j] += np.exp(-log_psi[t, i, j])

    # corner cases
    grad[0, 0, :] -= np.exp(log_beta[0, :] - log_Z)
    grad[0, 0, seq[0]] += np.exp(-log_psi[0, 0, seq[0]])

    return grad

np.random.seed(1111)
V, m = 5, 10

log_psi = np.random.random([m, V, V])
psi = np.exp(log_psi)  # nonnegative
seq = np.random.choice(V, m)

grad_1 = gradient(seq, psi)
grad_2 = gradient_log(seq, log_psi)

print(grad_1[0, :, :])
print(grad_2[0, :, :])
print(np.sum(np.abs(grad_1 - grad_2)))
[[ 0.75834232 -0.13348772 -0.16172055 -0.10355687 -0.12819671]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]
[[ 0.75834232 -0.13348772 -0.16172055 -0.10355687 -0.12819671]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]
1.11508025036e-14

在 log 域, 我们一般直接计算目标函数相对与 lnψ ln ⁡ ψ 的梯度计算公式如下:

lnp(s1,,sm|x1,,xm)lnψks,s=lnp(s1,,sm|x1,,xm)ψks,sψks,slnψks,s=δs=sk1&s=skexp(lnα(k1,s)+lnβ(k,s)lnZ+lnψks,s) ∂ ln ⁡ p ( s 1 , … , s m | x 1 , … , x m ) ∂ ln ⁡ ψ s ′ , s k = ∂ ln ⁡ p ( s 1 , … , s m | x 1 , … , x m ) ∂ ψ s ′ , s k ⋅ ∂ ψ s ′ , s k ∂ ln ⁡ ψ s ′ , s k = δ s ′ = s k − 1 & s = s k − exp ⁡ ( ln ⁡ α ( k − 1 , s ′ ) + ln ⁡ β ( k , s ) − ln ⁡ Z + ln ⁡ ψ s ′ , s k )

只需将上面的 grad_log 稍做改动即可,不再赘述。

3. CRF + 人工神经网络

3.1 势函数选择

目前为止,我们都假设函数已经知道,在此基础上推导 CRF 的相关计算。理论上,除了非负性的要求 ,势函数可以灵活的选择。为也便于计算和训练,CRF 中一般选择指数的形式。假设输入为 x1,,xm x 1 , … , x m ,则势函数定义为:

ψ(s,s,i)=exp(wϕ(x1,,xm,s,s,i)) ψ ( s ′ , s , i ) = exp ⁡ ( w ⋅ ϕ ( x 1 , … , x m , s ′ , s , i ) )


ψ(s1,,sm)=i=1mψ(si1,si,i)=i=1mexp(wϕ(x1,,xm,si1,si,i)) ψ ( s 1 , … , s m ) = ∏ i = 1 m ψ ( s i − 1 , s i , i ) = ∏ i = 1 m exp ⁡ ( w ⋅ ϕ ( x 1 , … , x m , s i − 1 , s i , i ) )

其中, ϕ(x1,,xm,s,s,i)Rd ϕ ( x 1 , … , x m , s ′ , s , i ) ∈ R d 是特征向量, wRd w ∈ R d 是参数向量。

对于线性链模型,简化势函数为:

ψ(s,s,i)=t(s|s)e(s|xi) ψ ( s ′ , s , i ) = t ( s | s ′ ) e ( s | x i )

转移势函数定义为:

t(s|s)=exp(vg(s,s)) t ( s | s ′ ) = exp ⁡ ( v ⋅ g ( s ′ , s ) )

发射势函数定义为:

e(s|xi)=exp(wf(s,xi)) e ( s | x i ) = exp ⁡ ( w ⋅ f ( s , x i ) )

则:

ψ(s1,,sm)=j=1mψ(sj1,sj,j)=j=1mt(sj|sj1)e(s|xj)=j=1mexp(vg(sj1,sj))exp(wf(sj,xj)) ψ ( s 1 , … , s m ) = ∏ j = 1 m ψ ( s j − 1 , s j , j ) = ∏ j = 1 m t ( s j | s j − 1 ) e ( s | x j ) = ∏ j = 1 m exp ⁡ ( v ⋅ g ( s j − 1 , s j ) ) ⋅ exp ⁡ ( w ⋅ f ( s j , x j ) )

ψ(s1,,sm)=exp(i=1mvg(si1,si)+i=1mwf(si,xi)) ψ ( s 1 , … , s m ) = exp ⁡ ( ∑ i = 1 m v ⋅ g ( s i − 1 , s i ) + ∑ i = 1 m w ⋅ f ( s i , x i ) )

如果我们取对数,则我们得到一个线性模型,定义:

scoret(s|s)=logt(s|s)=vg(s,s) s c o r e t ( s | s ′ ) = log ⁡ t ( s | s ′ ) = v ⋅ g ( s ′ , s )

scoree(s|xi)=loge(s|xi)=wf(s,xi) s c o r e e ( s | x i ) = log ⁡ e ( s | x i ) = w ⋅ f ( s , x i )

logψ(s1,,sm)=i=1mvg(si1,si)+i=1mwf(si,xi)=i=1mscoret(si1|si)+i=1mscoree(si|xi) log ⁡ ψ ( s 1 , … , s m ) = ∑ i = 1 m v ⋅ g ( s i − 1 , s i ) + ∑ i = 1 m w ⋅ f ( s i , x i ) = ∑ i = 1 m s c o r e t ( s i − 1 | s i ) + ∑ i = 1 m s c o r e e ( s i | x i )

具体的,可以定义

scoret(sj|si)=Pij s c o r e t ( s j | s i ) = P i j

其中, P P |S|×|S| | S | × | S | 的转移矩阵。

如果 x=(x1,,xm)Rm x = ( x 1 , ⋯ , x m ) ∈ R m ,则有:

scoree(sj|xi)=Wjxi s c o r e e ( s j | x i ) = W j ⋅ x i

其中, WR|s|×n W ∈ R | s | × n 是权重矩阵。

logψ(s1,,sm)=i=1mscoret(s|s)+i=1mscoree(s|xi)=i=1mPsi1si+i=1mWsixi log ⁡ ψ ( s 1 , … , s m ) = ∑ i = 1 m s c o r e t ( s | s ′ ) + ∑ i = 1 m s c o r e e ( s | x i ) = ∑ i = 1 m P s i − 1 s i + ∑ i = 1 m W s i ⋅ x i

这里,为简单起见,我们令 xi x i 是一个标量,实际中 xi x i 往往是向量。
x x logψ log ⁡ ψ 再到 ψ ψ 都是可导的操作(四则运算和指数、对数运算),而 ψ ψ 的梯度我们上面已经推导可以求得。因此,我们可以利用误差反传计算 W W 等参数的梯度,从而利用 SGD 等优化方法训练包括 CRF 在内的整个模型的参数。

def score(seq, x, W, P, S):
    m = len(seq)
    V = len(W)

    log_psi = np.zeros([m, V, V])

    # corner cases
    for i in range(V):
        # emit
        log_psi[0, 0, i] += S[i]
        # transmit
        log_psi[0, 0, i] += x[0] * W[i]

    for t in range(1, m):
        for i in range(V):
            for j in range(V):
                # emit
                log_psi[t, i, j] += x[t] * W[j]
                # transmit
                log_psi[t, i, j] += P[i, j]

    return log_psi   

def gradient_param(seq, x, W, P, S):
    m = len(seq)
    V = len(W)

    log_psi = score(seq, x, W, P, S)

    grad_psi = gradient_log(seq, log_psi)
    grad_log_psi = np.exp(log_psi) * grad_psi

    grad_x = np.zeros_like(x)
    grad_W = np.zeros_like(W)
    grad_P = np.zeros_like(P)
    grad_S = np.zeros_like(S)

    # corner cases
    for i in range(V):
        # emit
        grad_S[i] += grad_log_psi[0, 0, i]
        # transmit
        grad_W[i] += grad_log_psi[0, 0, i] * x[0]
        grad_x[0] += grad_log_psi[0, 0, i] * W[i]

    for t in range(1, m):
        for i in range(V):
            for j in range(V):
                # emit
                grad_W[j] += grad_log_psi[t, i, j] * x[t]
                grad_x[t] += grad_log_psi[t, i, j] * W[j]
                # transmit
                grad_P[i, j] += grad_log_psi[t, i, j]

    return grad_x, grad_W, grad_P, grad_S

np.random.seed(1111)
V, m = 5, 7

seq = np.random.choice(V, m)
x = np.random.random(m)
W = np.random.random(V)
P = np.random.random([V, V])
S = np.random.random(V)

grad_x, grad_W, grad_P, grad_S = gradient_param(seq, x, W, P, S)

print(grad_x)
print(grad_W)
print(grad_P)
print(grad_S)
[ 0.03394788 -0.11666261  0.02592661  0.07931277  0.02549323  0.11371901
  0.02198856]
[-0.62291675 -0.38050215 -0.18983737 -0.65300231  1.84625859]
[[-0.34655117 -0.27314013 -0.16800195 -0.28352514  0.73359469]
 [-0.22747135 -0.2967193  -0.27009443 -0.2664594   0.87349324]
 [-0.27906702 -0.27747362 -0.33689934 -0.18786182  0.82788735]
 [-0.2701056  -0.16940564 -0.2624276  -0.29133856 -0.25558298]
 [ 0.72105085  0.86080584  0.76931185 -0.2103895  -0.11362927]]
[-0.17736447 -0.21489701 -0.20747999 -0.19735031  0.79709179]

梯度正确性检验如下:

def check_grad(seq, x, W, P, S, toleration=1e-5, delta=1e-10):
    m, V, _ = psi.shape

    grad_x, grad_W, grad_P, grad_S = gradient_param(seq, x, W, P, S)

    def llk(seq, x, W, P, S):
        log_psi = score(seq, x, W, P, S)
        spi = np.exp(log_psi)
        log_p = np.log(pro(seq, spi))
        return log_p

    # grad_x
    print('Check X')
    for i in range(len(x)):
        original = x[i]
        grad_1 = grad_x[i]

        # p1
        x[i] = original - delta
        p1 = llk(seq, x, W, P, S)

        # p2
        x[i] = original + delta
        p2 = llk(seq, x, W, P, S)

        x[i] = original
        grad_2 = (p2 - p1) / (2 * delta)

        diff = np.abs(grad_1 - grad_2) / np.abs(grad_2)
        if diff > toleration:
            print("%d, %.2e, %.2e, %.2e" % (i, grad_1, grad_2, diff))

    # grad_W
    print('Check W')
    for i in range(len(W)):
        original = W[i]
        grad_1 = grad_W[i]

        # p1
        W[i] = original - delta
        p1 = llk(seq, x, W, P, S)

        # p2
        W[i] = original + delta
        p2 = llk(seq, x, W, P, S)

        W[i] = original
        grad_2 = (p2 - p1) / (2 * delta)

        diff = np.abs(grad_1 - grad_2) / np.abs(grad_2)
        if diff > toleration:
            print("%d, %.2e, %.2e, %.2e" % (i, grad_1, grad_2, diff))

    # grad_P
    print('Check P')
    for i in range(V):
        for j in range(V):
            original = P[i][j]
            grad_1 = grad_P[i][j]

            # p1
            P[i][j] = original - delta
            p1 = llk(seq, x, W, P, S)

            # p2
            P[i][j] = original + delta
            p2 = llk(seq, x, W, P, S)

            P[i][j] = original
            grad_2 = (p2 - p1) / (2 * delta)

            diff = np.abs(grad_1 - grad_2) / np.abs(grad_2)
            if diff > toleration:
                print("%d, %.2e, %.2e, %.2e" % (i, grad_1, grad_2, diff))

    # grad_S
    print('Check S')
    for i in range(len(S)):
        original = S[i]
        grad_1 = grad_S[i]

        # p1
        S[i] = original - delta
        p1 = llk(seq, x, W, P, S)

        # p2
        S[i] = original + delta
        p2 = llk(seq, x, W, P, S)

        S[i] = original
        grad_2 = (p2 - p1) / (2 * delta)

        diff = np.abs(grad_1 - grad_2) / np.abs(grad_2)
        if diff > toleration:
            print("%d, %.2e, %.2e, %.2e" % (i, grad_1, grad_2, diff))

np.random.seed(1111)
V, m = 5, 10

seq = np.random.choice(V, m)
x = np.random.random(m)
W = np.random.random(V)
P = np.random.random([V, V])
S = np.random.random(V)

check_grad(seq, x, W, P, S)
Check X
1, 6.75e-02, 6.75e-02, 5.74e-05
2, 5.14e-01, 5.14e-01, 1.47e-05
3, -3.17e-01, -3.17e-01, 1.51e-05
5, -6.42e-02, -6.42e-02, 7.82e-05
8, -4.38e-02, -4.38e-02, 1.08e-04
Check W
0, -6.55e-01, -6.55e-01, 1.13e-05
2, -1.33e-03, -1.33e-03, 3.77e-04
3, 5.88e-02, 5.89e-02, 1.15e-04
Check P
0, -4.50e-01, -4.51e-01, 1.03e-05
0, -2.70e-01, -2.70e-01, 2.53e-05
1, -2.11e-01, -2.11e-01, 3.13e-05
1, -2.35e-01, -2.35e-01, 1.80e-05
2, -2.93e-01, -2.93e-01, 1.76e-05
2, -1.50e-01, -1.50e-01, 2.15e-05
2, -1.72e-01, -1.72e-01, 3.40e-05
2, -3.48e-01, -3.48e-01, 1.02e-05
3, -1.90e-01, -1.90e-01, 3.10e-05
3, -3.60e-01, -3.60e-01, 1.78e-05
4, 5.47e-01, 5.47e-01, 1.50e-05
Check S
0, -2.02e-01, -2.02e-01, 2.13e-05
1, -1.97e-01, -1.97e-01, 1.82e-05
2, -1.05e-01, -1.05e-01, 6.22e-05

3.2 Bi-LSTM + CRF

CRF 是强大的序列学习准则。配合双向循环神经网络(e.g. Bi-LSTM)的特征表征和学习能力,在许多序列学习任务上都取得了领先的结果[5~7]。

基本模型如下:
【Learning Notes】线性链条件随机场(CRF)原理及实现_第3张图片
图4. Bi-LSTM CRF 模型【src】

Bi-LSTM 对整个输入序列进行特征提取和建模,用非线性的模型建模发射得分;转移得分用另外的 P P 表示,作为 CRF 自身的参数。相对于常规的用于神经网络训练的目标函数,CRF 是带参数的损失函数。

基于 pytorch 的 CRFLoss 实现见 repo 以及[3, 4],BiLSTM + CRF 的 实现应用见[8]。

讨论

  1. CRF 广泛应用于序列标注任务。
  2. 由于是区分性模型,因此在分类任务上,CRF 相比 HMM 可能会更高效。
  3. CRF 对于输出之间的关系进行了建模,这不同于 直接的 RNN 或 CTC 模型(相应的,CRF 训练和预测的计算量也更大)。

References

  1. Sutton and McCuallum. An Introduction to Conditional Random Fields.
  2. Michael Collins.The Forward-Backward Algorithm.
  3. Pytorch CRF Forward and Viterbi Implementation.
  4. BiLSTM-CRF on PyTorch.
  5. Collobert. Deep Learning for Efficient Discriminative Parsing.
  6. Collobert et al. Natural Language Processing (Almost) from Scratch.
  7. Huang et al. Bidirectional LSTM-CRF Models for Sequence Tagging.
  8. Bi-LSTM-CRF for NLP.

你可能感兴趣的:(原创,机器学习)