本篇文章主要是转自一个github上的代码,是因为自己对于word2vec词向量的预训练模型嵌入有点不熟悉,因此通过这个简单模型的代码,让我有了一个清晰的认识,希望也能帮助有需要的人。
import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
import gensim
1.定义句子文本信息,用于测试
# 2-gram
CONTEXT_SIZE = 2
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
2.对切分的单词进行编码
# 给每个单词编码,也就是用数字来表示每个单词,这样才能够传入word embeding得到词向量。
vocab = set(test_sentence) # 通过set将重复的单词去掉
word_to_idx = {
word: i+1 for i, word in enumerate(vocab)}
# 定义了一个unknown的词,也就是说没有出现在训练集里的词,我们都叫做unknown,词向量就定义为0。
word_to_idx['' ] = 0
idx_to_word = {
i+1: word for i, word in enumerate(vocab)}
idx_to_word[0] = ''
# 将数据整理好,也就是我们需要将单词三个分组,每个组前两个作为传入的数据,而最后一个作为预测的结果。
trigram = [((test_sentence[i], test_sentence[i+1]), test_sentence[i+2])
for i in range(len(test_sentence)-2)]
3.使用、加载预训练词向量
wvmodel = gensim.models.KeyedVectors.load_word2vec_format('/Users/wyw/Documents/vectors/word2vec/word2vec.6B.100d.txt', binary=False, encoding='utf-8')
vocab_size = len(word_to_idx)
embed_size = 100
weight = torch.zeros(vocab_size, embed_size)
for i in range(len(wvmodel.index2word)):
try:
index = word_to_idx[wvmodel.index2word[i]]
except:
continue
weight[index, :] = torch.from_numpy(wvmodel.get_vector(
idx_to_word[word_to_idx[wvmodel.index2word[i]]]))
4.定义模型
class NgramModel(nn.Module):
def __init__(self, vocb_size, context_size, n_dim):
super(NgramModel, self).__init__()
self.n_word = vocb_size
# 在Embedding层中不使用预训练好的word2vec词向量
# self.embedding = nn.Embedding(self.n_word, n_dim)
# 使用预训练词向量
self.embedding = nn.Embedding.from_pretrained(weight)
# requires_grad指定是否在训练过程中对词向量的权重进行微调
self.embedding.weight.requires_grad = True
self.linear1 = nn.Linear(context_size*n_dim, 128)
self.linear2 = nn.Linear(128, self.n_word)
def forward(self, x):
emb = self.embedding(x)
emb = emb.view(1, -1)
out = self.linear1(emb)
out = F.relu(out)
out = self.linear2(out)
log_prob = F.log_softmax(out)
return log_prob
ngrammodel = NgramModel(len(word_to_idx), CONTEXT_SIZE, 100)
criterion = nn.NLLLoss()
optimizer = optim.SGD(ngrammodel.parameters(), lr=1e-3)
开始训练
一共跑300个epoch,在每个epoch中,word代表着预测单词的前面两个词,label表示要预测的词,接着进入网络得到结果,然后通过loss函数得到loss进行反向传播,更新参数。
此例中,使用预训练的词向量在训练过程中的收敛速度会远慢于不使用预训练词向量,这是因为样本太小,不使用预训练词向量的时候会很更快的达到过拟合,所以损失值会减小的更快。可以动手试一下,也可以改一下self.embedding.weight.requires_grad = False,看一下在训练过程中微调词向量能否带来影响,但是因为样本太小,估计也看不出啥影响 =__=
for epoch in range(300):
print('epoch: {}'.format(epoch+1))
print('*'*10)
running_loss = 0
for data in trigram:
word, label = data
word = torch.LongTensor([word_to_idx[i] for i in word])
label = torch.LongTensor([word_to_idx[label]])
# forward
out = ngrammodel(word)
loss = criterion(out, label)
running_loss += loss.data[0]
# backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('Loss: {:.6f}'.format(running_loss / len(word_to_idx)))
5.检测模型效果
word, label = trigram[3]
word = torch.LongTensor([word_to_idx[i] for i in word])
out = ngrammodel(word)
_, predict_label = torch.max(out, 1)
predict_word = idx_to_word[predict_label.item()]
print('real word is {}, predict word is {}'.format(label, predict_word))
转自:https://github.com/atnlp/torchtext-summary/blob/master/Language-Model.ipynb
参考:https://blog.csdn.net/nlpuser/article/details/83627709