文本分类任务中几种attention机制的介绍

文本分类任务的一大核心就是获得文本的准确语义表示,笔者之前在文本分类任务中只是简单地调用LSTM或GRU来获得文本的词向量表示。在阅读论文和github项目时,会发现顶会论文在获得文本的语义向量时会使用Attention机制。下面,博主就介绍几种文本分类任务中在获得文本语义向量表示的过程中Attention机制的运用(后续随着论文的阅读会不断更新)。

adversarialLSTM的attention机制

def _attention(self, H):
        """
        利用Attention机制得到句子的向量表示
        """
        # 获得最后一层LSTM的神经元数量
        hiddenSize = config.model.hiddenSizes[-1]
        
        # 初始化一个权重向量,是可训练的参数
        W = tf.Variable(tf.random_normal([hiddenSize], stddev=0.1))
        
        # 对Bi-LSTM的输出用激活函数做非线性转换
        M = tf.tanh(H)
        
        # 对W和M做矩阵运算,M=[batch_size, time_step, hidden_size],计算前做维度转换成[batch_size * time_step, hidden_size]
        # newM = [batch_size*time_step, 1],每一个时间步的输出由向量转换成一个数字
        newM = tf.matmul(tf.reshape(M, [-1, hiddenSize]), tf.reshape(W, [-1, 1]))
        
        # 对newM做维度转换成[batch_size, time_step]
        restoreM = tf.reshape(newM, [-1, config.sequenceLength])
        
        # 用softmax做归一化处理[batch_size, time_step]
        self.alpha = tf.nn.softmax(restoreM)
        
        # 利用求得的alpha的值对H进行加权求和,用矩阵运算直接操作
        r = tf.matmul(tf.transpose(H, [0, 2, 1]), tf.reshape(self.alpha, [-1, config.sequenceLength, 1]))
        
        # 将三维压缩成二维sequeezeR=[batch_size, hidden_size]
        sequeezeR = tf.squeeze(r)
        
        sentenceRepren = tf.tanh(sequeezeR)
        
        # 对Attention的输出可以做dropout处理
        output = tf.nn.dropout(sentenceRepren, self.dropoutKeepProb)
        
        return output
  • H为Bi-LSTM的输出,H到M时,shape保持不变,为[batch_size,time_step,hiddenSize],W的shape为[hiddenSize]
  • newM = tf.matmul(tf.reshape(M, [-1, hiddenSize]), tf.reshape(W, [-1,1])),此时newM的shape为[batch_size*time_step,1],接着restoreM的shape为[batch_size,time_step],这里面time_step与config.sequenceLength相同
  • restoreM经过siftmax层后得到的shape与其保持不变。 r = tf.matmul(tf.transpose(H, [0,
    2, 1]), tf.reshape(self.alpha, [-1, config.sequenceLength,
    1]))中[batch_size,hiddenSize,time_step]与[batch_size,time_step,1]做矩阵乘法,得到的shape为[batch_size,hiddenSize,1]
  • 接着将[batch_size,hiddenSize,1]压缩成二维,在经过Tanh和Dropout后就得到注意力机制后的向量,shape为[batch_size,hiddenSize]

HiGRU的attention机制

# Dot-product attention
def get_attention(q, k, v, attn_mask=None):
	"""
	:param : (batch, seq_len, seq_len)
	:return: (batch, seq_len, seq_len)
	"""
	attn = torch.matmul(q, k.transpose(1, 2))
	
	if attn_mask is not None:
		attn.data.masked_fill_(attn_mask, -1e10)
		
	attn = F.softmax(attn, dim=-1)
	
	output = torch.matmul(attn, v)
	
	return output, attn
# Get mask for attention
def get_attn_pad_mask(seq_q, seq_k):

	assert seq_q.dim() == 2 and seq_k.dim() == 2
	
	pad_attn_mask = torch.matmul(seq_q.unsqueeze(2).float(), seq_k.unsqueeze(1).float())
	
	pad_attn_mask = pad_attn_mask.eq(Const.PAD)  # b_size x 1 x len_k
	
	#print(pad_attn_mask)
	
	return pad_attn_mask.cuda(seq_k.device)
  • HiGRU模型中引入了mask机制。我们在做文本分类任务时经常会将不足长度的文本进行补齐。然而补齐的0是不应该参与计算的,在这里的HiGRU模型就考虑到这一问题,使用Masked机制,对补0的位置赋予一个极小值,这样在softmax层时pad的位置的概率就为0,从而不进入计算。
  • 这里面的get_attention函数的输入q、k、v的shape为经过GRU结构后的输出为[batch_size,time_step,hidden_size]
  • q k v的输入的shape相同,值也相同。k转置后的shape为[batch_size,hidden_size,time_step],q与k的转置的相乘为atten的shape为[batch_size,time_step,time_step],atten与v([batch_size,time_step,hidden_size])做矩阵乘法计算得到注意力机制后的[batch_size,time_step,hidden_size]

MDRE的attention机制

def luong_attention(batch_size, target, condition, target_encoder_length, hidden_dim):
    # same dim [batch, max_seq, embed]

    batch_seq_embed_target = tf.reshape(target, [batch_size, target_encoder_length, hidden_dim])

    batch_embed_given = condition
    
    batch_seq_embed_given = tf.reshape(batch_embed_given, [batch_size, hidden_dim, 1])
    
    # calculate similarity
    dot = tf.matmul(batch_seq_embed_target, batch_seq_embed_given)
    
    norm_dot = tf.nn.softmax(dot, dim=1)
    
    # weighted sum by using similarity (normalized)
    
    target_mul_norm = tf.multiply(batch_seq_embed_target, norm_dot)
    
    weighted_sum = tf.reduce_sum(target_mul_norm, axis=1)
    
    return weighted_sum
  • MDRE的模型是融合语音和文本数据来进行情绪分类。论文中认为语音特征与文本特征之间存在着某种注意力机制,因此在这里计算的是语音的序列特征经过GRU之后得到向量表示与文本经过GRU之后的向量做attention机制。
  • target的shape为文本经过GRU的输出为[batch_size,time_step,hidden_size],condition为语音序列特征经过GRU结构得到的输出后又接一个全连接层是的最后一维的大小与文本的hidden_size相等,condition的shape为[batch_size,hidden_size]
  • dot后的shape为[batch_size,time_step,1],做softmax后norm_dot的shape与之前不变,接着batch_seq_embed_target,与norm_dot做点乘,shape为[batch_size,time_step,hidden_size],在time_step上求和,得到文本在语音上的注意力加权输出为[batch_size,hidden_size]

Transformer的self-attention机制

transformer在获取文本语义词向量表达时并没有使用到任何LSTM或GRU结构。我们结合代码来看一看Transformer在文本分类时的attention机制。
encode部分主函数

def encode(self, xs, training=True):
        '''
        Returns
        memory: encoder outputs. (N, T1, d_model)
        '''
        with tf.variable_scope("encoder", reuse=tf.AUTO_REUSE):
            x, seqlens, sents1 = xs

            # embedding
            enc = tf.nn.embedding_lookup(self.embeddings, x) # (N, T1, d_model)
            enc *= self.hp.d_model**0.5 # scale

            enc += positional_encoding(enc, self.hp.maxlen1)
            enc = tf.layers.dropout(enc, self.hp.dropout_rate, training=training)

            ## Blocks
            for i in range(self.hp.num_blocks):
                with tf.variable_scope("num_blocks_{}".format(i), reuse=tf.AUTO_REUSE):
                    # self-attention
                    enc = multihead_attention(queries=enc,
                                              keys=enc,
                                              values=enc,
                                              num_heads=self.hp.num_heads,
                                              dropout_rate=self.hp.dropout_rate,
                                              training=training,
                                              causality=False)
                    # feed forward
                    enc = ff(enc, num_units=[self.hp.d_ff, self.hp.d_model])
        memory = enc
        return memory, sents1
  • 以上代码可得,输入的enc为文本的词向量表示加上文本中单词的位置向量,我们看到在Block部分会做多次循环,即有多个self-attention和feedforward结构。
  • 在multihead_attention机制中会输入queries、keys、values,即查询向量,键值向量和值向量,其输入都是一致的,就是enc。接下来我们来介绍multihead_attention函数。

multihead_attention

def multihead_attention(queries, keys, values,
                        num_heads=8, 
                        dropout_rate=0,
                        training=True,
                        causality=False,
                        scope="multihead_attention"):
    '''Applies multihead attention. See 3.2.2
    queries: A 3d tensor with shape of [N, T_q, d_model].
    keys: A 3d tensor with shape of [N, T_k, d_model].
    values: A 3d tensor with shape of [N, T_k, d_model].
    num_heads: An int. Number of heads.
    dropout_rate: A floating point number.
    training: Boolean. Controller of mechanism for dropout.
    causality: Boolean. If true, units that reference the future are masked.
    scope: Optional scope for `variable_scope`.
        
    Returns
      A 3d tensor with shape of (N, T_q, C)  
    '''
    d_model = queries.get_shape().as_list()[-1]
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        # Linear projections
        Q = tf.layers.dense(queries, d_model, use_bias=False) # (N, T_q, d_model)
        K = tf.layers.dense(keys, d_model, use_bias=False) # (N, T_k, d_model)
        V = tf.layers.dense(values, d_model, use_bias=False) # (N, T_k, d_model)
        
        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, d_model/h)
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, d_model/h)

        # Attention
        outputs = scaled_dot_product_attention(Q_, K_, V_, causality, dropout_rate, training)

        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, d_model)
              
        # Residual connection
        outputs += queries
              
        # Normalize
        outputs = ln(outputs)
 
    return outputs
  • queries, keys,values的shape为[batch_size,time_steps,text_embs+position_embs],Q K V分别是由queries, keys, values经过全连接层得到的,需要指出的是这里面经过全连接层前后的shape不发生变化。
  • 接着根据多头机制,Q_ K_ V_的shape为[batch_sizenum_heads,time_steps,(text_embs+position_embs)/num_heads],接着Q_ K_ V_进入scaled_dot_product_attention注意力计算函数,得到outputs的shape为[batch_sizenum_heads,time_steps
  • (text_embs+position_embs)/num_heads],再还原成
  • [batch_size,time_steps,text_embs+position_embs]
  • outputs与原始输入做残差神经网络,再借一个Layer_Norm(层规则化函数)

scaled_dot_product_attention

def scaled_dot_product_attention(Q, K, V,
                                 causality=False, dropout_rate=0.,
                                 training=True,
                                 scope="scaled_dot_product_attention"):
    '''See 3.2.1.
    Q: Packed queries. 3d tensor. [N, T_q, d_k].
    K: Packed keys. 3d tensor. [N, T_k, d_k].
    V: Packed values. 3d tensor. [N, T_k, d_v].
    causality: If True, applies masking for future blinding
    dropout_rate: A floating point number of [0, 1].
    training: boolean for controlling droput
    scope: Optional scope for `variable_scope`.
    '''
    with tf.variable_scope(scope, reuse=tf.AUTO_REUSE):
        d_k = Q.get_shape().as_list()[-1]

        # dot product
        outputs = tf.matmul(Q, tf.transpose(K, [0, 2, 1]))  # (N, T_q, T_k)

        # scale
        outputs /= d_k ** 0.5

        # key masking
        outputs = mask(outputs, Q, K, type="key")

        # causality or future blinding masking
        if causality:
            outputs = mask(outputs, type="future")

        # softmax
        outputs = tf.nn.softmax(outputs)
        attention = tf.transpose(outputs, [0, 2, 1])
        tf.summary.image("attention", tf.expand_dims(attention[:1], -1))

        # query masking
        outputs = mask(outputs, Q, K, type="query")

        # dropout
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=training)

        # weighted sum (context vectors)
        outputs = tf.matmul(outputs, V)  # (N, T_q, d_v)

    return outputs
  • Q K V的输入为[batch_size,time_steps,text_embs+position_embs],Q与K的转置为[batch_size,time_steps,time_steps]
  • outpus除以最后维度的开方,然后再将outputs对key和queries时做mask机制,得到的outputs(shape为[batch_size,time_steps,time_steps])后做softmax机制得到相应的权重,再与V相乘得到的输出为[batch_size,time_steps,text_embs+position_embs]

mask

def mask(inputs, queries=None, keys=None, type=None):
    """Masks paddings on keys or queries to inputs
    inputs: 3d tensor. (N, T_q, T_k)
    queries: 3d tensor. (N, T_q, d)
    keys: 3d tensor. (N, T_k, d)
    e.g.,
    >> queries = tf.constant([[[1.],
                        [2.],
                        [0.]]], tf.float32) # (1, 3, 1)
    >> keys = tf.constant([[[4.],
                     [0.]]], tf.float32)  # (1, 2, 1)
    >> inputs = tf.constant([[[4., 0.],
                               [8., 0.],
                               [0., 0.]]], tf.float32)
    >> mask(inputs, queries, keys, "key")
    array([[[ 4.0000000e+00, -4.2949673e+09],
        [ 8.0000000e+00, -4.2949673e+09],
        [ 0.0000000e+00, -4.2949673e+09]]], dtype=float32)
    >> inputs = tf.constant([[[1., 0.],
                             [1., 0.],
                              [1., 0.]]], tf.float32)
    >> mask(inputs, queries, keys, "query")
    array([[[1., 0.],
        [1., 0.],
        [0., 0.]]], dtype=float32)
    """
    padding_num = -2 ** 32 + 1
    if type in ("k", "key", "keys"):
        # Generate masks
        masks = tf.sign(tf.reduce_sum(tf.abs(keys), axis=-1))  # (N, T_k)
        masks = tf.expand_dims(masks, 1) # (N, 1, T_k)
        masks = tf.tile(masks, [1, tf.shape(queries)[1], 1])  # (N, T_q, T_k)

        # Apply masks to inputs
        paddings = tf.ones_like(inputs) * padding_num
        outputs = tf.where(tf.equal(masks, 0), paddings, inputs)  # (N, T_q, T_k)
    elif type in ("q", "query", "queries"):
        # Generate masks
        masks = tf.sign(tf.reduce_sum(tf.abs(queries), axis=-1))  # (N, T_q)
        masks = tf.expand_dims(masks, -1)  # (N, T_q, 1)
        masks = tf.tile(masks, [1, 1, tf.shape(keys)[1]])  # (N, T_q, T_k)

        # Apply masks to inputs
        outputs = inputs*masks
    elif type in ("f", "future", "right"):
        diag_vals = tf.ones_like(inputs[0, :, :])  # (T_q, T_k)
        tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense()  # (T_q, T_k)
        masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(inputs)[0], 1, 1])  # (N, T_q, T_k)

        paddings = tf.ones_like(masks) * padding_num
        outputs = tf.where(tf.equal(masks, 0), paddings, inputs)
    else:
        print("Check if you entered type correctly!")


    return outputs

mask机制对query做mask时,考虑到了补0的位置不参与后续计算,因此在补0的位置上赋予一个极小值,这样在后续经过softmax层进行计算的时候补0的位置概率变为0就不会参与后续的计算。

总结

目前常见的attention机制可以分为两类:
一类:基于循环神经网络的attention机制,要么是加权池化(后续会更新),要么是输出的转置进行相乘,最后会得到shape[batch_size,time_steps,time_steps]或者[batch_size,time_steps,time_steps]的tensor
二类:类似于transformer结构的不用循环神经网络,只有attention机制就可达到很好的效果。
后续还会继续总结。
相关链接:Transformer代码
用于文本分类的Transformer

你可能感兴趣的:(自然语言处理)