Transformer实现以及Pytorch源码解读(一)-数据输入篇

目标

以词性标注任务为例子,实现Transformer,并分析实现Pytorch的源码解读。

数据准备

所选的数据为nltk数据工具中的treebank数据集。treebank数据集的样子如以下两幅图所示:
该数据集中解释变量为若干句完整的句子:
在这里插入图片描述
被解释变量为该句子中每个词的词性:
在这里插入图片描述
具体每个词性简写的意思,大概如下文所示(参考博客):

标注词表:
名词:NN,NNS,NNP,NNPS
代词:PRP,PRP$
形容词:JJ,JJR,JJS
数词:CD
动词:VB,VBD,VBG,VBN,VBP,VBZ
副词:RB,RBR,RBS
1.CC      Coordinating conjunction 连接词
2.CD     Cardinal number  基数词
3.DT     Determiner 
 限定词(如this,that,these,those,such,不定限定词:no,some,any,each,every,enough,either,neither,all,both,half,several,many,much,(a)
 few,(a) little,other,another.
4.EX     Existential there 存在句
5.FW     Foreign word 外来词
6.IN     Preposition or subordinating conjunction 介词或从属连词
7.JJ     Adjective 形容词或序数词
8.JJR     Adjective, comparative 形容词比较级
9.JJS     Adjective, superlative 形容词最高级
10.LS     List item marker 列表标示
11.MD     Modal 情态助动词
12.NN     Noun, singular or mass 常用名词 单数形式
13.NNS     Noun, plural  常用名词 复数形式
14.NNP     Proper noun, singular  专有名词,单数形式
15.NNPS     Proper noun, plural  专有名词,复数形式
16.PDT     Predeterminer 前位限定词
17.POS     Possessive ending 所有格结束词
18.PRP     Personal pronoun 人称代词
19.PRP$     Possessive pronoun 所有格代名词
20.RB     Adverb 副词
21.RBR     Adverb, comparative 副词比较级
22.RBS     Adverb, superlative 副词最高级
23.RP     Particle 小品词
24.SYM     Symbol 符号
25.TO     to 作为介词或不定式格式
26.UH     Interjection 感叹词
27.VB     Verb, base form 动词基本形式
28.VBD     Verb, past tense 动词过去式
29.VBG     Verb, gerund or present participle 动名词和现在分词
30.VBN     Verb, past participle 过去分词
31.VBP     Verb, non-3rd person singular present 动词非第三人称单数
32.VBZ     Verb, 3rd person singular present 动词第三人称单数
33.WDT     Wh-determiner 限定词(如关系限定词:whose,which.疑问限定词:what,which,whose.)
34.WP      Wh-pronoun 代词(who whose which)
35.WP$     Possessive wh-pronoun 所有格代词
36.WRB     Wh-adverb   疑问代词(how where when)

处理过程


from nltk.corpus import treebank

#表示句子,以及句子中每个词的词性
sents, postags = zip(*(zip(*sent) for sent in treebank.tagged_sents()))

# 对涉及到的单词和词性进行唯一化的处理,并为每个词指定一个整数
vocab = Vocab.build(sents, reserved_tokens=[""])
tag_vocab = Vocab.build(postags)

#前3000的句子作为训练集,后3000的句子作为测试集.同时将每个单词用整数表示
train_data = [(vocab.convert_tokens_to_ids(sentence), tag_vocab.convert_tokens_to_ids(tags)) for sentence, tags in zip(sents[:3000], postags[:3000])]
test_data = [(vocab.convert_tokens_to_ids(sentence), tag_vocab.convert_tokens_to_ids(tags)) for sentence, tags in zip(sents[3000:], postags[3000:])]
pos_vocab=tag_vocab

vocab类如下:

class Vocab:
    def __init__(self, tokens=None):
        self.idx_to_token = list()
        self.token_to_idx = dict()

        if tokens is not None:
            if "" not in tokens:
                tokens = tokens + [""]
            for token in tokens:
                self.idx_to_token.append(token)
                self.token_to_idx[token] = len(self.idx_to_token) - 1
            self.unk = self.token_to_idx['']

    @classmethod
    def build(cls, text, min_freq=1, reserved_tokens=None):
        token_freqs = defaultdict(int)
        for sentence in text:
            for token in sentence:
                token_freqs[token] += 1
        uniq_tokens = [""] + (reserved_tokens if reserved_tokens else [])
        uniq_tokens += [token for token, freq in token_freqs.items() \
                        if freq >= min_freq and token != ""]
        return cls(uniq_tokens)

    def __len__(self):
        return len(self.idx_to_token)

    def __getitem__(self, token):
        return self.token_to_idx.get(token, self.unk)

    def convert_tokens_to_ids(self, tokens):
        return [self[token] for token in tokens]

    def convert_ids_to_tokens(self, indices):
        return [self.idx_to_token[index] for index in indices]

对vocab类的解释:统计解释变量中涉及到的单词的出现频率,同时为每个单词分配一个整数,作为该单词的整数表示。关于@classmethod的意义请看博客 。
随后根据处理后的数据构建迭代器:

# 给输入数据构造迭代器
train_dataset = TransformerDataset(train_data)
test_dataset = TransformerDataset(test_data)

#将迭代器中的数据,用加载器加载。最主要的是collate_fn函数,他表示对迭代器中的数据进行怎样的处理
train_data_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, shuffle=True)
test_data_loader = DataLoader(test_dataset, batch_size=1, collate_fn=collate_fn, shuffle=False)

num_class = len(pos_vocab)

#加载模型
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

TransformerDataset类的定义如下:

class TransformerDataset(Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, i):
        return self.data[i]

def collate_fn(examples):
    """
    wfj:该函数表示对于batch_size中的每一个元素做以下一下的操作,通常用来进行数据的标准化工作
    """
    # print("==========================")
    # print(examples)
    # print(len(examples))
    lengths = torch.tensor([len(ex[0]) for ex in examples])
    inputs = [torch.tensor(ex[0]) for ex in examples]
    targets = [torch.tensor(ex[1]) for ex in examples]
    # 对batch内的样本进行padding,使其具有相同长度
    inputs = pad_sequence(inputs, batch_first=True, padding_value=vocab[""])
    targets = pad_sequence(targets, batch_first=True, padding_value=vocab[""])
    #输出的几个参数的解释:解释变量;每个解释变量的长度;被解释变量;是否为填充位的标记。
    return inputs, lengths, targets, inputs != vocab[""]

关于collate_fn的解释,请看博客.
最后是构建Transformer类:

class Transformer(nn.Module):
   def __init__(self, vocab_size, embedding_dim, hidden_dim, num_class,
                dim_feedforward=512, num_head=2, num_layers=2, dropout=0.1, max_len=512, activation: str = "relu"):
       super(Transformer, self).__init__()
       # 词嵌入层
       self.embedding_dim = embedding_dim
       self.embeddings = nn.Embedding(vocab_size, embedding_dim)
       self.position_embedding = PositionalEncoding(embedding_dim, dropout, max_len)
       # 编码层:使用Transformer
       encoder_layer = nn.TransformerEncoderLayer(hidden_dim, num_head, dim_feedforward, dropout, activation)
       self.transformer = nn.TransformerEncoder(encoder_layer, num_layers)
       # 输出层
       self.output = nn.Linear(hidden_dim, num_class)

   def forward(self, inputs, lengths):
       inputs = torch.transpose(inputs, 0, 1)
       hidden_states = self.embeddings(inputs)
       for inp,hid in zip(inputs,hidden_states):
           print("===================================")
           print(inp)
           print(hid)
       hidden_states = self.position_embedding(hidden_states)
       attention_mask = length_to_mask(lengths) == False
       hidden_states = self.transformer(hidden_states, src_key_padding_mask=attention_mask).transpose(0, 1)
       logits = self.output(hidden_states)
       log_probs = F.log_softmax(logits, dim=-1)
       return log_probs

由于Transformer类继承了nn.model,因此实例化该类并调用的时候forward中的代码会自动的执行。通过查阅源码我们发现,所有的forward是通过_call_impl这个方法实现的,而_call_impl被__call__方法调用,因此实例化后的类可以被直接的调用。
在这里插入图片描述
最后是模型的整体训练部分。

model.train()
for epoch in range(num_epoch):
  total_loss = 0
  for batch in tqdm(train_data_loader, desc=f"Training Epoch {epoch}"):
      inputs, lengths, targets, mask = [x.to(device) for x in batch]
      log_probs = model(inputs, lengths)
      loss = nll_loss(log_probs[mask], targets[mask])
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      total_loss += loss.item()
      break
  print(f"Loss: {total_loss:.2f}")
  break

接下来我们看Transformer类中的方法是怎样被具体实现的,比如词编码即embedding的过程,请看博客:
Transformer实现以及Pytorch源码解读(二)-embedding源码

你可能感兴趣的:(Pytorch,NLP,transformer,pytorch,深度学习)