pytorch入门NLP教程(二)——CBOW

在上一个教程中我们说到了NNLM,但是NNLM虽然考虑的一个词前面的词对它的影响,但是没有办法顾忌到后面的词,而且计算量较大,所以可以使用Word2vec中的一个模型CBOW。


目标:通过周围的词预测中心词 w ( t ) w(t) w(t)

目标函数 J = ∑ ω ∈ c o r p u s P ( w ∣ c o n t e n t ( w ) ) J = \sum_{\omega\in corpus}P(w|content(w)) J=ωcorpusP(wcontent(w))

输入:上下文单词的onehot,假设单词向量空间dim为V,上下文单词个数为C,所以输入矩阵维度为 C × V C\times V C×V

PROJECTION:输入的向量每个词的onehot乘上输入权重矩阵W( V × N V\times N V×N)相加求平均作为隐层向量维度为 1 × N 1\times N 1×N

输出:用上面得到的向量乘上输出权重矩阵W’( N × V N\times V N×V),得到输出向量维度为 1 × V 1\times V 1×V,即概率向量
举个例子:




import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

corpus = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

# 模型参数
window_size = 2
embeding_dim = 100
hidden_dim = 128

# 数据预处理
sentences = corpus.split()  # 分词
words = list(set(sentences))
word_dict = {word: i for i, word in enumerate(words)}  # 每个词对应的索引
data = []  # 准备数据
for i in range(window_size, len(sentences)-window_size):
    content = [sentences[i-1], sentences[i-2],
               sentences[i+1], sentences[i+2]]
    target = sentences[i]
    data.append((content, target))
print(data[:5])

# 处理输入数据
def make_content_vector(content, word_to_ix):
    idx = [word_to_ix[w] for w in content]
    return torch.LongTensor(idx)

# CBOW模型
class CBOW(nn.Module):
    def __init__(self, vocab_size, n_dim, window_size, hidden_dim):
        super(CBOW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, n_dim)
        self.linear1 = nn.Linear(2*n_dim*window_size, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, vocab_size)

    def forward(self, X):
        embeds = self.embedding(X).view(1, -1)
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

# 训练模型
model = CBOW(len(word_dict), embeding_dim, window_size, hidden_dim)
if torch.cuda.is_available():
    model = model.cuda()
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(500):
    total_loss = 0
    for content, target in data:
        content_vector = make_content_vector(content, word_dict)
        target = torch.tensor([word_dict[target]], dtype=torch.long)
        if torch.cuda.is_available():
            content_vector = content_vector.cuda()
            target = target.cuda()
        
        optimizer.zero_grad()
        
        log_probs = model(content_vector)
        loss = criterion(log_probs, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    if (epoch + 1) % 100 == 0:
        print('Epoch:', '%03d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))

你可能感兴趣的:(深度学习,NLP)