2021年4月,论文Attention is all you need。
Transformer优点:是基于self-attetion的,self-attention的确有着cnn和lstm都没有的优势,比如比cnn看得更宽更远,比lstm训练更快;重复累加多层multi-head self-attetion还在被不短证明着其强大的表达能力!
与LSTM作用差不多,一般作为特征编码器,??
/Anaconda3/Lib/site-packages/torch/nn/modules/transformer.py
结构很清晰:
class Transformer(Module):...
class TransformerEncoder(Module):...
class TransformerDecoder(Module):...
class TransformerEncoderLayer(Module):...
class TransformerDecoderLayer(Module):...
1.1 Transformer 基本结构
拿出代码中的关键部分,如下:整体结构分为2个部分:encoder、decoder,encoder由TransformerEncoder实现,TransformerEncoder 由 TransformerEncoderLayer 实现;decoder 由TransformerDecoder 实现,TransformerDecoder 由TransformerDecoderLayer 实现。
class Transformer(Module):
def __init__(self, ) -> None:
super(Transformer, self).__init__()
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward,
dropout, activation)
encoder_norm = LayerNorm(d_model)
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers,
encoder_norm)
decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward,
dropout, activation)
decoder_norm = LayerNorm(d_model)
self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers,
decoder_norm)
self._reset_parameters()
self.d_model = d_model
self.nhead = nhead
def forward(self, ) -> Tensor:
memory = self.encoder(src, mask=src_mask,
src_key_padding_mask=src_key_padding_mask)
output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
tgt_key_padding_mask=tgt_key_padding_mask,
memory_key_padding_mask=memory_key_padding_mask)
return output
1.2 TransformerEncoder 输入输出参数
常用策略:使用TransformerEncoder作为特征提取器。
如何使用TransformerEncoder提取特征?
根据注释:TransformerEncoder是由多层的encoder_layer层组成,因此,需要指定第一个参数encoder_layer,第二个参数num_layers=包含的encoder的层数,第三个参数norm=layer normalization
r"""TransformerEncoder is a stack of N encoder layers Args: encoder_layer: an instance of the TransformerEncoderLayer() class (required). num_layers: the number of sub-encoder-layers in the encoder (required). norm: the layer normalization component (optional). Examples:: >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8) >>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6) >>> src = torch.rand(10, 32, 512) >>> out = transformer_encoder(src) """
Transformer中encoder定义的源码
encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation)
encoder_norm = LayerNorm(d_model)
self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
forward参数
Args: src: the sequence to the encoder (required). mask: the mask for the src sequence (optional). src_key_padding_mask: the mask for the src keys per batch (optional).
mask主要可以分为两种mask,一种是src_mask,一种是src_key_padding_mask, 这里我们主要解释src_key_padding_mask。src_key_padding_mask的size,必须是 NxS ,即 batch x seqlenths
通过这个mask,就可以将padding的部分忽略掉,让attention注意力机制不再参与这一部分的运算。src_key_padding_mask 是一个二值化的tensor,在需要被忽略地方应该是True,在需要保留原值的情况下,是False.
根据Examples或者源码实现一个特征提取器
需要自己实现embedding,position embedding,mask
import torch.nn
import torch
# 输入一个batch,包含3个样本
input_data=[['有','一','个'],
['名','字'],
['我','不','知','道']]
# 经过index->id,并且padding成相同长度,这里batch的最大长度=4,shape=[batch,t]
input_id =torch.Tensor(
[[2,3,6,0],
[7,8,0,0],
[12,9,5,67]])
#根据padding,确定的src_key_padding_mask,shape=[batch,t]
src_key_padding_mask=torch.Tensor(
[[0,0,0,1],
[0,0,1,1],
[0,0,0,0]], dtype=torch.bool)
# 经过embedding :shape=[batch,t,em_size]
input_embedding=em_model(input_id)
#input一般为embedding的输出
input = input_embedding
d_model = 128 # 期望的特征维度,transformer的输入和输出特征维度一致
num_encoder_layers = 6 # encoder包含多少个子层
encoder_layer = nn.TransformerEncoderLayer(d_model=d_model, nhead=8)
encoder_norm = nn.LayerNorm(d_model)
encoder_model = nn.TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)
# input[batch,t,d_model],mask[batch,t]
output = encoder_model(input,src_key_padding_mask=src_key_padding_mask)
1.3 Transformer 输入输出
Examples:: >>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12) >>> src = torch.rand((10, 32, 512)) >>> tgt = torch.rand((20, 32, 512)) >>> out = transformer_model(src, tgt)
todo:tgt是什么??decoder部分的输入??
Transformer 可以解决哪些问题?数据形式呢?
看一下decoder的部分:
训练的时候:
1. 初始 decoder 的 time step 为 1 时(也就是第一次接收输入),其输入为一个特殊的 token,可能是目标序列开始的 token(如
),也可能是源序列结尾的 token(如 ),也可能是其它视任务而定的输入等等,不同源码中可能有微小的差异,其目标则是预测翻译后的第1个单词 (token) 是什么; 2. 然后
和预测出来的第 1 个单词一起,再次作为 decoder 的输入,得到第 2 个预测单词;3 后续依此类推;
对编码的需求
- 需要体现同一单词在不同位置的区别。
- 需要体现一定的先后次序,并且在一定范围内的编码差异不应该依赖于文本的长度,具有一定的不变性。
- 需要有值域的范围限制。
公式及优缺点:体现不同位置的区别,值域有限,但不具备方向性。
参考:https://zhuanlan.zhihu.com/p/166244505
https://wmathor.com/index.php/archives/1438/
怎么用??权重值不可更新。
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
def positional_encoding(max_seq_len, embed_dim):
"""
初始化一个positional encoding,不可学习(不参与更新)
:param max_seq_len:最大的序列长度
:param embed_dim:位置嵌入的维度
:return:
"""
positional_encoding = np.array([
[pos / np.power(10000, 2 * i / embed_dim) for i in range(embed_dim)]
if pos != 0 else np.zeros(embed_dim) for pos in range(max_seq_len)])
# todo 为什么要剔除第一个位置??
positional_encoding[1:, 0::2] = np.sin(positional_encoding[1:, 0::2]) # dim 2i 偶数
positional_encoding[1:, 1::2] = np.cos(positional_encoding[1:, 1::2]) # dim 2i+1 奇数
return positional_encoding
def get_sen_encoding(encoding, position_list):
output = []
for i in position_list:
output.append(encoding[i])
return np.array(output)
if __name__ == '__main__':
positional_encoding = positional_encoding(max_seq_len=100, embed_dim=16)
print(positional_encoding) # (100, 16)
# todo 不确定是否这样使用/
sen = ['我', '们', '一', '起', '去', '旅', '行', '吧'] # len=8
sen_pos_id = [0, 1, 2, 3, 4, 5, 6, 7]
output = get_sen_encoding(positional_encoding, sen_pos_id)
print(output.shape)
# plt.figure(figsize=(10, 10))
# sns.heatmap(positional_encoding)
# plt.title("Sinusoidal Function")
# plt.xlabel("hidden dimension")
# plt.ylabel("sequence length")
# plt.show()
3.2 scaled dot-product attention,Multi_head_attention, encoder
https://wmathor.com/index.php/archives/1438/
https://blog.csdn.net/qq_37236745/article/details/107352273
https://blog.csdn.net/weixin_41811314/article/details/106804906 关于两个mask的区别
该代码用于自学transformer的encoder部分,因为没有实现src_key_padding_mask的功能,因此,在项目中使用时,最好使用torch的源码。
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
# by xmm
class ScaledDotProductAttention(nn.Module):
"""
Compute 'Scaled Dot Product Attention'
Attention(Q,K,V) = softmax(Q*Kt/sqrt(dk)) *V
"""
""" for test
q = torch.randn(4, 8, 10, 64) # (batch, n_head, seqLen, dim)
k = torch.randn(4, 8, 10, 64)
v = torch.randn(4, 8, 10, 64)
mask = torch.ones(4, 8, 10, 10)
model = ScaledDotProductAttention()
res = model(q, k, v, mask)
print(res[0].shape) # torch.Size([4, 8, 10, 64])
"""
def forward(self, query, key, value, attn_mask=None, dropout=None):
"""
当QKV来自同一个向量的矩阵变换时称作self-attention;
当Q和KV来自不同的向量的矩阵变换时叫soft-attention;
url:https://www.e-learn.cn/topic/3764324
url:https://my.oschina.net/u/4228078/blog/4497939
:param query: (batch, n_head, seqLen, dim) 其中n_head表示multi-head的个数,且n_head*dim = embedSize
:param key: (batch, n_head, seqLen, dim)
:param value: (batch, n_head, seqLen, dim)
:param mask: (batch, n_head, seqLen,seqLen) 这里的mask应该是attn_mask;原来attention的位置为0,no attention部分为1
:param dropout:
"""
# (batch, n_head, seqLen,seqLen) attention weights的形状是L*L,因为每个单词两两之间都有一个weight
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(query.size(-1))
if attn_mask is not None:
scores = scores.masked_fill(attn_mask == 0, -1e9) # 保留位置为0的值,其他位置填充极小的数
p_attn = F.softmax(scores, dim=-1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn # (batch, n_head, seqLen, dim)
# by xmm
class MultiHeadAttention(nn.Module):
"""
for test :
q = torch.randn(4, 10, 8 * 64) # (batch, n_head, seqLen, dim)
k = torch.randn(4, 10, 8 * 64)
v = torch.randn(4, 10, 8 * 64)
mask = torch.ones(4, 8, 10, 10)
model = MultiHeadAttention(h=8, d_model=8 * 64)
res = model(q, k, v, mask)
print(res.shape) # torch.Size([4, 10, 512])
"""
def __init__(self, h, d_model, dropout=0.1):
super(MultiHeadAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linear_layers = nn.ModuleList([nn.Linear(d_model, d_model) for _ in range(3)])
self.output_linear = nn.Linear(d_model, d_model)
self.attention = ScaledDotProductAttention()
self.dropout = nn.Dropout(p=dropout)
def forward(self, query, key, value, attn_mask=None):
"""
:param query: (batch,seqLen, d_model)
:param key: (batch,seqLen, d_model)
:param value: (batch,seqLen, d_model)
:param mask: (batch, seqLen,seqLen)
:return: (batch,seqLen, d_model)
"""
batch_size = query.size(0)
# 1, Do all the linear projections in batch from d_model => h x d_k
query, key, value = [l(x).view(batch_size, -1, self.h, self.d_k).transpose(1, 2)
for l, x in zip(self.linear_layers, (query, key, value))]
# 2,Apply attention on all the projected vectors in batch.
if attn_mask:
attn_mask = attn_mask.unsqueeze(1).repeat(1, self.h, 1, 1) # (batch, n_head,seqLen,seqLen)
x, atten = self.attention(query, key, value, attn_mask=attn_mask, dropout=self.dropout)
# 3, "Concat" using a view and apply a final linear.
x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.h * self.d_k)
return self.output_linear(x)
# by xmm
class PositionwiseFeedForward(nn.Module):
"Implements FFN equation."
def __init__(self, d_model, dim_feedforward, dropout, activation):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, dim_feedforward)
self.w_2 = nn.Linear(dim_feedforward, d_model)
self.dropout = nn.Dropout(dropout)
self.activation = activation
def forward(self, x):
return self.dropout(self.w_2(self.activation(self.w_1(x))))
# by xmm
class TransformerEncoderLayer(nn.Module):
"""
Bidirectional Encoder = Transformer (self-attention)
Transformer = MultiHead_Attention + Feed_Forward with sublayer connection
Example:
"""
def __init__(self, d_model, n_head, dim_feedforward, dropout=0.1, activation="relu"):
"""
:param d_model:
:param n_head:
:param dim_feedforward:
:param dropout:
:param activation: default :relu
"""
super().__init__()
self.self_attn = MultiHeadAttention(h=n_head, d_model=d_model, dropout=dropout)
self.dropout = nn.Dropout(dropout)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
if activation == "relu":
self.activation = F.relu
if activation == "gelu":
self.activation = F.gelu
self.PositionwiseFeedForward = PositionwiseFeedForward(d_model=d_model, dim_feedforward=dim_feedforward,
dropout=dropout, activation=self.activation)
def forward(self, x, atten_mask):
"""
:param x: (batch, seqLen, em_dim)
:param mask: attn_mask
:return:
"""
# add & norm 1
attn = self.dropout(self.self_attn(x, x, x, attn_mask=atten_mask))
x = self.norm1((x + attn))
# # add & norm 2
x = self.norm2(x + self.PositionwiseFeedForward(x))
return x
class TransformerEncoder(nn.Module):
"""
Example:
x = torch.randn(4, 10, 128) # (batch, seqLen, em_dim)
model = TransformerEncoder(d_model=128, n_head=8, nlayers=3)
res = model.forward(x)
print(res.shape) # torch.Size([4, 10, 128])
"""
def __init__(self, d_model, n_head, nlayers, dim_feedforward=1024, dropout=0.1, activation="relu"):
super(TransformerEncoder, self).__init__()
self.encoder = nn.ModuleList([TransformerEncoderLayer(d_model, n_head, dim_feedforward, dropout, activation)
for _ in range(nlayers)])
def forward(self, x, atten_mask=None):
"""
:param x: input dim == out dim
:param atten_mask: 对应源码的src_mask,没有实现src_key_padding_mask
:return:
"""
for layer in self.encoder:
x = layer.forward(x, atten_mask)
return x
if __name__ == '__main__':
x = torch.randn(4, 10, 128) # (batch, seqLen, em_dim)
model = TransformerEncoder(d_model=128, n_head=8, nlayers=3)
res = model.forward(x)
print(res.shape) # torch.Size([4, 10, 128])