首先,word2vec 和 word embedding 两者并不是平级的,其都属于 语言表示(Representation)的范畴。
语言表示(Representation)是将人类的自然语言表示成计算机可以处理的数值形式,一般的方式有独热表示(one-hot Representation)、分布式表示(Distributed Reprensentation)等。
分布式表示(Distributed Reprensentation)又包括基于矩阵、基于聚类、基于神经网络的方式,一般将 基于神经网络的分布式表示 称为 词嵌入(word embedding)。
而 word embedding 又包括很多不同的算法集合,或者叫做实现工具,比如 SEENA、FastText、word2vec 等。
具体的关系如下:
Representation → Distributed Reprensentation → word embedding → word2vec
具体可参见如下文章word embeding
word embeding and word2vec
以及邱锡鹏《神经网络与深度学习》表示学习章节
事实上,比较容易想到的直觉的方案,就是一句话中里的最近的一般相关性会高一些
比如下面这句话:
对于这句话来说,we are 的相关性显然高于 we 和 NLP或者其他的词相关性。这也印证了上面说的直觉的方案,也为我们提供了思路:
1.Skip-Gram Mode
Skip-gram算法就是在给出目标单词(中心单词)的情况下,预测它的上下文单词(除中心单词外窗口内的其他单词,这里的窗口大小是2,也就是左右各两个单词,那么 NLP的上下文单词为:working,on,project,it)
对于如下:
We _working _ NLP project, it is interesting,在语料库的训练下,想要求的working两边的的词可能是什们(事实上还有一种CBOW的这里暂时不讲,有兴趣的可以参看论文链接)。
怎们求呢:
因为我们要求working两边的的词可能是什么,其实就是求和working相关的词,那么怎们才算相关度高(相邻的词),那就相当于在某种语料环境下(We are working on NLP project, it is interesting),让这个词working左右两边出现的词的概率最大max(P(xxx|working)),在概率最大的情况下进行一个softmax变换,在概率最大约束的情况下求出参数(u,v),求出词和词之间的word2vec向量,具体的形式就是如下图,根据条件概率转换为softmax,然后求Word2vec,
word2vec
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.utils.data as Data
dtype = torch.FloatTensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
sentences = ["is a domesticated carnivoran of the family Canidae",
"It is part of the wolf-like canids and is the most widely abundant terrestrial carnivore",
"The dog and the extant gray wolf are sister taxa as modern wolves are not closely related to the wolves that were first domesticated which implies that the direct ancestor of the dog is extinct",
"The dog and the extant gray wolf are sister taxa as modern wolves are not closely related to the wolves that were first domesticated which implies that the direct ancestor of the dog is extinct",
"Their long association with humans has led dogs to be uniquely attuned to human behavior[18] and they are able to thrive on a starch-rich diet that would be inadequate for other canids",
"Dogs vary widely in shape, size and colors",
"They perform many roles for humans such as hunting, herding, pulling loads protection assisting police and military companionship and more recently aiding disabled people and therapeutic roles"]
word_sequence = " ".join(sentences).split()
vocab = list(set(word_sequence)) # build words vocabulary
word2idx = {w: i for i, w in enumerate(vocab)}
# Word2Vec Parameters
batch_size = 8
embedding_size = 2 # 2 dim vector represent one word
C = 2 # window size
voc_size = len(vocab)
# 1.
skip_grams = []
for idx in range(C, len(word_sequence) - C):
center = word2idx[word_sequence[idx]] # center word
context_idx = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1)) # context word idx
context = [word2idx[word_sequence[i]] for i in context_idx]
for w in context:
skip_grams.append([center, w])
# 2.
def make_data(skip_grams):
input_data = []
output_data = []
for i in range(len(skip_grams)):
input_data.append(np.eye(voc_size)[skip_grams[i][0]])
output_data.append(skip_grams[i][1])
return input_data, output_data
# 3.
input_data, output_data = make_data(skip_grams)
input_data, output_data = torch.Tensor(input_data), torch.LongTensor(output_data)
dataset = Data.TensorDataset(input_data, output_data)
loader = Data.DataLoader(dataset, batch_size, True)
# Model
class Word2Vec(nn.Module):
def __init__(self):
super(Word2Vec, self).__init__()
# W and V is not Traspose relationship
self.W = nn.Parameter(torch.randn(voc_size, embedding_size).type(dtype))
self.V = nn.Parameter(torch.randn(embedding_size, voc_size).type(dtype))
def forward(self, X):
# X : [batch_size, voc_size] one-hot
# torch.mm only for 2 dim matrix, but torch.matmul can use to any dim
hidden_layer = torch.matmul(X, self.W) # hidden_layer : [batch_size, embedding_size]
output_layer = torch.matmul(hidden_layer, self.V) # output_layer : [batch_size, voc_size]
return output_layer
model = Word2Vec().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Training
for epoch in range(2000):
for i, (batch_x, batch_y) in enumerate(loader):
batch_x = batch_x.to(device)
batch_y = batch_y.to(device)
pred = model(batch_x)
loss = criterion(pred, batch_y)
if (epoch + 1) % 1000 == 0:
print(epoch + 1, i, loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
for i, label in enumerate(vocab):
W, WT = model.parameters()
x, y = float(W[i][0]), float(W[i][1])
plt.scatter(x, y)
plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()
论文:
文章介绍了两种基本模型:CBOW和Skip-Gram模型的原理和求导的细节,之后介绍了优化模型的方法:分层softmax和负采样技术。是理解word2vec的非常好的资料
word2vec Parameter Learning Explained
译文
https://zhuanlan.zhihu.com/p/52787964