在上一个教程中我们说到了NNLM,但是NNLM虽然考虑的一个词前面的词对它的影响,但是没有办法顾忌到后面的词,而且计算量较大,所以可以使用Word2vec中的一个模型CBOW。
目标:通过周围的词预测中心词 w ( t ) w(t) w(t)
目标函数: J = ∑ ω ∈ c o r p u s P ( w ∣ c o n t e n t ( w ) ) J = \sum_{\omega\in corpus}P(w|content(w)) J=∑ω∈corpusP(w∣content(w))
输入:上下文单词的onehot,假设单词向量空间dim为V,上下文单词个数为C,所以输入矩阵维度为 C × V C\times V C×V
PROJECTION:输入的向量每个词的onehot乘上输入权重矩阵W( V × N V\times N V×N)相加求平均作为隐层向量维度为 1 × N 1\times N 1×N
输出:用上面得到的向量乘上输出权重矩阵W’( N × V N\times V N×V),得到输出向量维度为 1 × V 1\times V 1×V,即概率向量
举个例子:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
corpus = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""
# 模型参数
window_size = 2
embeding_dim = 100
hidden_dim = 128
# 数据预处理
sentences = corpus.split() # 分词
words = list(set(sentences))
word_dict = {word: i for i, word in enumerate(words)} # 每个词对应的索引
data = [] # 准备数据
for i in range(window_size, len(sentences)-window_size):
content = [sentences[i-1], sentences[i-2],
sentences[i+1], sentences[i+2]]
target = sentences[i]
data.append((content, target))
print(data[:5])
# 处理输入数据
def make_content_vector(content, word_to_ix):
idx = [word_to_ix[w] for w in content]
return torch.LongTensor(idx)
# CBOW模型
class CBOW(nn.Module):
def __init__(self, vocab_size, n_dim, window_size, hidden_dim):
super(CBOW, self).__init__()
self.embedding = nn.Embedding(vocab_size, n_dim)
self.linear1 = nn.Linear(2*n_dim*window_size, hidden_dim)
self.linear2 = nn.Linear(hidden_dim, vocab_size)
def forward(self, X):
embeds = self.embedding(X).view(1, -1)
out = F.relu(self.linear1(embeds))
out = self.linear2(out)
log_probs = F.log_softmax(out, dim=1)
return log_probs
# 训练模型
model = CBOW(len(word_dict), embeding_dim, window_size, hidden_dim)
if torch.cuda.is_available():
model = model.cuda()
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(500):
total_loss = 0
for content, target in data:
content_vector = make_content_vector(content, word_dict)
target = torch.tensor([word_dict[target]], dtype=torch.long)
if torch.cuda.is_available():
content_vector = content_vector.cuda()
target = target.cuda()
optimizer.zero_grad()
log_probs = model(content_vector)
loss = criterion(log_probs, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
if (epoch + 1) % 100 == 0:
print('Epoch:', '%03d' % (epoch + 1), 'cost =', '{:.6f}'.format(loss))