Transformer Net

1.transformer 能干什么

2021年4月,论文Attention is all you need。

Transformer优点:是基于self-attetion的,self-attention的确有着cnn和lstm都没有的优势,比如比cnn看得更宽更远,比lstm训练更快;重复累加多层multi-head self-attetion还在被不短证明着其强大的表达能力!

与LSTM作用差不多,一般作为特征编码器,??

2.在pytorch中已经实现了该模型

/Anaconda3/Lib/site-packages/torch/nn/modules/transformer.py 

结构很清晰:

class Transformer(Module):...
class TransformerEncoder(Module):...
class TransformerDecoder(Module):...
class TransformerEncoderLayer(Module):...
class TransformerDecoderLayer(Module):...

 1.1 Transformer 基本结构

拿出代码中的关键部分,如下:整体结构分为2个部分:encoder、decoder,encoder由TransformerEncoder实现,TransformerEncoder 由 TransformerEncoderLayer 实现;decoder 由TransformerDecoder 实现,TransformerDecoder 由TransformerDecoderLayer 实现。

class Transformer(Module): 

    def __init__(self, ) -> None:
        super(Transformer, self).__init__()
        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, 
                                           dropout, activation)
        encoder_norm = LayerNorm(d_model)
        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, 
                                           encoder_norm)

      
        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, 
                                                       dropout, activation)
        decoder_norm = LayerNorm(d_model)
        self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, 
                                                       decoder_norm)

        self._reset_parameters()
        self.d_model = d_model
        self.nhead = nhead

    def forward(self, ) -> Tensor:  
        memory = self.encoder(src, mask=src_mask,                 
                              src_key_padding_mask=src_key_padding_mask)
        output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
                              tgt_key_padding_mask=tgt_key_padding_mask,
                              memory_key_padding_mask=memory_key_padding_mask)
        return output

1.2 TransformerEncoder 输入输出参数

常用策略:使用TransformerEncoder作为特征提取器。

如何使用TransformerEncoder提取特征?

根据注释:TransformerEncoder是由多层的encoder_layer层组成,因此,需要指定第一个参数encoder_layer,第二个参数num_layers=包含的encoder的层数,第三个参数norm=layer normalization

r"""TransformerEncoder is a stack of N encoder layers

Args:
    encoder_layer: an instance of the TransformerEncoderLayer() class (required).
    num_layers: the number of sub-encoder-layers in the encoder (required).
    norm: the layer normalization component (optional).

Examples::
    >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
    >>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
    >>> src = torch.rand(10, 32, 512)
    >>> out = transformer_encoder(src)
"""

 Transformer中encoder定义的源码

encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
encoder_norm = LayerNorm(d_model)
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

 forward参数

Args:
    src: the sequence to the encoder (required).
    mask: the mask for the src sequence (optional).
    src_key_padding_mask: the mask for the src keys per batch (optional).

 mask主要可以分为两种mask,一种是src_mask,一种是src_key_padding_mask, 这里我们主要解释src_key_padding_mask。src_key_padding_mask的size,必须是 NxS ,即 batch x seqlenths
通过这个mask,就可以将padding的部分忽略掉,让attention注意力机制不再参与这一部分的运算。src_key_padding_mask 是一个二值化的tensor,在需要被忽略地方应该是True,在需要保留原值的情况下,是False.

根据Examples或者源码实现一个特征提取器

需要自己实现embedding,position embedding,mask

import torch.nn
import torch

# 输入一个batch,包含3个样本 
input_data=[['有','一','个'],
            ['名','字'],
            ['我','不','知','道']]

# 经过index->id,并且padding成相同长度,这里batch的最大长度=4,shape=[batch,t]
input_id =torch.Tensor( 
            [[2,3,6,0],
            [7,8,0,0],
            [12,9,5,67]])

#根据padding,确定的src_key_padding_mask,shape=[batch,t]
src_key_padding_mask=torch.Tensor(
                    [[0,0,0,1],
                    [0,0,1,1],
                    [0,0,0,0]], dtype=torch.bool)


# 经过embedding :shape=[batch,t,em_size]
input_embedding=em_model(input_id)


#input一般为embedding的输出
input = input_embedding

d_model = 128  # 期望的特征维度,transformer的输入和输出特征维度一致
num_encoder_layers = 6 # encoder包含多少个子层

encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=8)
encoder_norm = nn.LayerNorm(d_model) 
encoder_model = nn.TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

# input[batch,t,d_model],mask[batch,t]
output = encoder_model(input,src_key_padding_mask=src_key_padding_mask) 

1.3 Transformer 输入输出

Examples::
    >>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
    >>> src = torch.rand((10, 32, 512))
    >>> tgt = torch.rand((20, 32, 512))
    >>> out = transformer_model(src, tgt)

todo:tgt是什么??decoder部分的输入??

Transformer 可以解决哪些问题?数据形式呢?

看一下decoder的部分:

训练的时候:

1. 初始 decoder 的 time step 为 1 时(也就是第一次接收输入),其输入为一个特殊的 token,可能是目标序列开始的 token(如 ),也可能是源序列结尾的 token(如),也可能是其它视任务而定的输入等等,不同源码中可能有微小的差异,其目标则是预测翻译后的第1个单词 (token) 是什么;

2. 然后 和预测出来的第 1 个单词一起,再次作为 decoder 的输入,得到第 2 个预测单词;3 后续依此类推;

3 手动实现子模块

3.1 positional encoding (区别于position embedding)

对编码的需求

  1. 需要体现同一单词在不同位置的区别。
  2. 需要体现一定的先后次序,并且在一定范围内的编码差异不应该依赖于文本的长度,具有一定的不变性。
  3. 需要有值域的范围限制

公式及优缺点:体现不同位置的区别,值域有限,但不具备方向性。

参考:https://zhuanlan.zhihu.com/p/166244505

https://wmathor.com/index.php/archives/1438/

Transformer Net_第1张图片

Transformer Net_第2张图片

 怎么用??权重值不可更新。

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math


def positional_encoding(max_seq_len, embed_dim):
    """
    初始化一个positional encoding,不可学习(不参与更新)
    :param max_seq_len:最大的序列长度
    :param embed_dim:位置嵌入的维度
    :return:
    """
    positional_encoding = np.array([
        [pos / np.power(10000, 2 * i / embed_dim) for i in range(embed_dim)]
        if pos != 0 else np.zeros(embed_dim) for pos in range(max_seq_len)])
    # todo 为什么要剔除第一个位置??
    positional_encoding[1:, 0::2] = np.sin(positional_encoding[1:, 0::2])  # dim 2i 偶数
    positional_encoding[1:, 1::2] = np.cos(positional_encoding[1:, 1::2])  # dim 2i+1 奇数
    return positional_encoding


def get_sen_encoding(encoding, position_list):
    output = []
    for i in position_list:
        output.append(encoding[i])
    return np.array(output)


if __name__ == '__main__':
    positional_encoding = positional_encoding(max_seq_len=100, embed_dim=16)
    print(positional_encoding)  # (100, 16)

    # todo 不确定是否这样使用/
    sen = ['我', '们', '一', '起', '去', '旅', '行', '吧']  # len=8
    sen_pos_id = [0, 1, 2, 3, 4, 5, 6, 7]
    output = get_sen_encoding(positional_encoding, sen_pos_id)
    print(output.shape)
    # plt.figure(figsize=(10, 10))
    # sns.heatmap(positional_encoding)
    # plt.title("Sinusoidal Function")
    # plt.xlabel("hidden dimension")
    # plt.ylabel("sequence length")
    # plt.show()

3.2 scaled dot-product attention,Multi_head_attention, encoder 

 https://wmathor.com/index.php/archives/1438/

https://blog.csdn.net/qq_37236745/article/details/107352273

https://blog.csdn.net/weixin_41811314/article/details/106804906  关于两个mask的区别

Transformer Net_第3张图片

该代码用于自学transformer的encoder部分,因为没有实现src_key_padding_mask的功能,因此,在项目中使用时,最好使用torch的源码

import torch
import torch.nn as nn
import torch.nn.functional as F
import math


# by xmm
class ScaledDotProductAttention(nn.Module):
    """
    Compute 'Scaled Dot Product Attention'
    Attention(Q,K,V) = softmax(Q*Kt/sqrt(dk)) *V
    """
    """ for test 
            q = torch.randn(4, 8, 10, 64)  # (batch, n_head, seqLen, dim)
            k = torch.randn(4, 8, 10, 64)
            v = torch.randn(4, 8, 10, 64)
            mask = torch.ones(4, 8, 10, 10)
            model = ScaledDotProductAttention()
            res = model(q, k, v, mask)
            print(res[0].shape)  # torch.Size([4, 8, 10, 64])
    """

    def forward(self, query, key, value, attn_mask=None, dropout=None):
        """
        当QKV来自同一个向量的矩阵变换时称作self-attention;
        当Q和KV来自不同的向量的矩阵变换时叫soft-attention;

        url:https://www.e-learn.cn/topic/3764324
        url:https://my.oschina.net/u/4228078/blog/4497939
          :param query: (batch, n_head, seqLen, dim)  其中n_head表示multi-head的个数,且n_head*dim = embedSize
          :param key: (batch, n_head, seqLen, dim)
          :param value: (batch, n_head, seqLen, dim)
          :param mask:  (batch, n_head, seqLen,seqLen) 这里的mask应该是attn_mask;原来attention的位置为0,no attention部分为1
          :param dropout:
          """
        # (batch, n_head, seqLen,seqLen) attention weights的形状是L*L,因为每个单词两两之间都有一个weight
        scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(query.size(-1))
        if attn_mask is not None:
            scores = scores.masked_fill(attn_mask == 0, -1e9)  # 保留位置为0的值,其他位置填充极小的数

        p_attn = F.softmax(scores, dim=-1)

        if dropout is not None:
            p_attn = dropout(p_attn)

        return torch.matmul(p_attn, value), p_attn  # (batch, n_head, seqLen, dim)


# by xmm
class MultiHeadAttention(nn.Module):
    """
    for test :
                q = torch.randn(4, 10, 8 * 64)  # (batch, n_head, seqLen, dim)
                k = torch.randn(4, 10, 8 * 64)
                v = torch.randn(4, 10, 8 * 64)
                mask = torch.ones(4, 8, 10, 10)
                model = MultiHeadAttention(h=8, d_model=8 * 64)
                res = model(q, k, v, mask)
                print(res.shape)  # torch.Size([4, 10, 512])
    """

    def __init__(self, h, d_model, dropout=0.1):
        super(MultiHeadAttention, self).__init__()
        assert d_model % h == 0

        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h

        self.linear_layers = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(3)])
        self.output_linear = nn.Linear(d_model, d_model)
        self.attention = ScaledDotProductAttention()

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, attn_mask=None):
        """

        :param query: (batch,seqLen, d_model)
        :param key: (batch,seqLen, d_model)
        :param value: (batch,seqLen, d_model)
        :param mask: (batch, seqLen,seqLen)
        :return: (batch,seqLen, d_model)
        """
        batch_size = query.size(0)

        # 1, Do all the linear projections in batch from d_model => h x d_k
        query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
                             for l, x in zip(self.linear_layers, (query, key, value))]

        # 2,Apply attention on all the projected vectors in batch.
        if attn_mask:
            attn_mask = attn_mask.unsqueeze(1).repeat(1, self.h, 1, 1)  # (batch, n_head,seqLen,seqLen)
        x, atten = self.attention(query, key, value, attn_mask=attn_mask, dropout=self.dropout)

        # 3, "Concat" using a view and apply a final linear.
        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
        return self.output_linear(x)


# by xmm
class PositionwiseFeedForward(nn.Module):
    "Implements FFN equation."

    def __init__(self, d_model, dim_feedforward, dropout, activation):
        super(PositionwiseFeedForward, self).__init__()
        self.w_1 = nn.Linear(d_model, dim_feedforward)
        self.w_2 = nn.Linear(dim_feedforward, d_model)
        self.dropout = nn.Dropout(dropout)
        self.activation = activation

    def forward(self, x):
        return self.dropout(self.w_2(self.activation(self.w_1(x))))


# by xmm
class TransformerEncoderLayer(nn.Module):
    """
    Bidirectional Encoder = Transformer (self-attention)
    Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
    Example:

    """

    def __init__(self, d_model, n_head, dim_feedforward, dropout=0.1, activation="relu"):
        """

        :param d_model:
        :param n_head:
        :param dim_feedforward:
        :param dropout:
        :param activation: default :relu
        """

        super().__init__()
        self.self_attn = MultiHeadAttention(h=n_head, d_model=d_model, dropout=dropout)
        self.dropout = nn.Dropout(dropout)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        if activation == "relu":
            self.activation = F.relu
        if activation == "gelu":
            self.activation = F.gelu

        self.PositionwiseFeedForward = PositionwiseFeedForward(d_model=d_model, dim_feedforward=dim_feedforward,
                                                               dropout=dropout, activation=self.activation)

    def forward(self, x, atten_mask):
        """

        :param x: (batch, seqLen, em_dim)
        :param mask: attn_mask
        :return:
        """
        # add & norm 1
        attn = self.dropout(self.self_attn(x, x, x, attn_mask=atten_mask))
        x = self.norm1((x + attn))

        # # add & norm 2
        x = self.norm2(x + self.PositionwiseFeedForward(x))
        return x


class TransformerEncoder(nn.Module):
    """
    Example:
           x = torch.randn(4, 10, 128)  # (batch, seqLen, em_dim)
        model = TransformerEncoder(d_model=128, n_head=8, nlayers=3)
        res = model.forward(x)
        print(res.shape)  # torch.Size([4, 10, 128])

    """

    def __init__(self, d_model, n_head, nlayers, dim_feedforward=1024, dropout=0.1, activation="relu"):
        super(TransformerEncoder, self).__init__()
        self.encoder = nn.ModuleList([TransformerEncoderLayer(d_model, n_head, dim_feedforward, dropout, activation)
                                      for _ in range(nlayers)])

    def forward(self, x, atten_mask=None):
        """

        :param x: input dim == out dim
        :param atten_mask: 对应源码的src_mask,没有实现src_key_padding_mask
        :return:
        """
        for layer in self.encoder:
            x = layer.forward(x, atten_mask)
        return x


if __name__ == '__main__':
    x = torch.randn(4, 10, 128)  # (batch, seqLen, em_dim)
    model = TransformerEncoder(d_model=128, n_head=8, nlayers=3)
    res = model.forward(x)
    print(res.shape)  # torch.Size([4, 10, 128])

你可能感兴趣的:(神经网络)