word2vec和word embedding

1.介绍

首先,word2vec 和 word embedding 两者并不是平级的,其都属于 语言表示(Representation)的范畴。
语言表示(Representation)是将人类的自然语言表示成计算机可以处理的数值形式,一般的方式有独热表示(one-hot Representation)、分布式表示(Distributed Reprensentation)等。
分布式表示(Distributed Reprensentation)又包括基于矩阵、基于聚类、基于神经网络的方式,一般将 基于神经网络的分布式表示 称为 词嵌入(word embedding)。
而 word embedding 又包括很多不同的算法集合,或者叫做实现工具,比如 SEENA、FastText、word2vec 等。
具体的关系如下:
Representation → Distributed Reprensentation → word embedding → word2vec

具体可参见如下文章word embeding
word embeding and word2vec
以及邱锡鹏《神经网络与深度学习》表示学习章节

2.word2vec

为什么要是用word2vec(相对于one-hot编码)

word2vec和word embedding_第1张图片

  • 更dence(与one-hot的sparse相对)
  • meaningful ,one-hot作为一个稀疏矩阵,我们无法得知连个向量之间的关系,而word2vec可以求出两个向量的关系,如上图,我们可以通过求距离的方式得知,machine 和learning的相关性更高。
  • 容量大:简单的来说及时,one-hot 一个单词可能需要一个维度,这就造成one-hot矩阵十分的巨大,然而,word2vec这种分布式表示方法,可以用有限的维度,表示无限的单词,因为向量里的数是实数。

3.怎们求

事实上,比较容易想到的直觉的方案,就是一句话中里的最近的一般相关性会高一些
比如下面这句话:

We are working on NLP project, it is interesting

对于这句话来说,we are 的相关性显然高于 we 和 NLP或者其他的词相关性。这也印证了上面说的直觉的方案,也为我们提供了思路:

1.Skip-Gram Mode

Skip-gram算法就是在给出目标单词(中心单词)的情况下,预测它的上下文单词(除中心单词外窗口内的其他单词,这里的窗口大小是2,也就是左右各两个单词,那么 NLP的上下文单词为:working,on,project,it)

对于如下:

We _working _ NLP project, it is interesting,在语料库的训练下,想要求的working两边的的词可能是什们(事实上还有一种CBOW的这里暂时不讲,有兴趣的可以参看论文链接)。
怎们求呢:
因为我们要求working两边的的词可能是什么,其实就是求和working相关的词,那么怎们才算相关度高(相邻的词),那就相当于在某种语料环境下(We are working on NLP project, it is interesting),让这个词working左右两边出现的词的概率最大max(P(xxx|working)),在概率最大的情况下进行一个softmax变换,在概率最大约束的情况下求出参数(u,v),求出词和词之间的word2vec向量,具体的形式就是如下图,根据条件概率转换为softmax,然后求Word2vec,

word2vec

word2vec和word embedding_第2张图片

经过如上图的训练就出U,V矩阵,也就是word2vec

pytorch实现:

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import torch.utils.data as Data

dtype = torch.FloatTensor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
sentences = ["is a domesticated carnivoran of the family Canidae",
             "It is part of the wolf-like canids and is the most widely abundant terrestrial carnivore",
             "The dog and the extant gray wolf are sister taxa as modern wolves are not closely related to the wolves that were first domesticated which implies that the direct ancestor of the dog is extinct",
             "The dog and the extant gray wolf are sister taxa as modern wolves are not closely related to the wolves that were first domesticated which implies that the direct ancestor of the dog is extinct",
             "Their long association with humans has led dogs to be uniquely attuned to human behavior[18] and they are able to thrive on a starch-rich diet that would be inadequate for other canids",
             "Dogs vary widely in shape, size and colors",
             "They perform many roles for humans such as hunting, herding, pulling loads protection assisting police and military companionship and more recently aiding disabled people and therapeutic roles"]

word_sequence = " ".join(sentences).split()
vocab = list(set(word_sequence))  # build words vocabulary

word2idx = {w: i for i, w in enumerate(vocab)}
# Word2Vec Parameters
batch_size = 8
embedding_size = 2  # 2 dim vector represent one word
C = 2  # window size
voc_size = len(vocab)

# 1.
skip_grams = []
for idx in range(C, len(word_sequence) - C):
    center = word2idx[word_sequence[idx]]  # center word
    context_idx = list(range(idx - C, idx)) + list(range(idx + 1, idx + C + 1))  # context word idx
    context = [word2idx[word_sequence[i]] for i in context_idx]
    for w in context:
        skip_grams.append([center, w])


# 2.
def make_data(skip_grams):
    input_data = []
    output_data = []
    for i in range(len(skip_grams)):
        input_data.append(np.eye(voc_size)[skip_grams[i][0]])
        output_data.append(skip_grams[i][1])
    return input_data, output_data


# 3.
input_data, output_data = make_data(skip_grams)
input_data, output_data = torch.Tensor(input_data), torch.LongTensor(output_data)
dataset = Data.TensorDataset(input_data, output_data)
loader = Data.DataLoader(dataset, batch_size, True)


# Model
class Word2Vec(nn.Module):
    def __init__(self):
        super(Word2Vec, self).__init__()

        # W and V is not Traspose relationship
        self.W = nn.Parameter(torch.randn(voc_size, embedding_size).type(dtype))
        self.V = nn.Parameter(torch.randn(embedding_size, voc_size).type(dtype))

    def forward(self, X):
        # X : [batch_size, voc_size] one-hot
        # torch.mm only for 2 dim matrix, but torch.matmul can use to any dim
        hidden_layer = torch.matmul(X, self.W)  # hidden_layer : [batch_size, embedding_size]
        output_layer = torch.matmul(hidden_layer, self.V)  # output_layer : [batch_size, voc_size]
        return output_layer


model = Word2Vec().to(device)
criterion = nn.CrossEntropyLoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# Training
for epoch in range(2000):
    for i, (batch_x, batch_y) in enumerate(loader):
        batch_x = batch_x.to(device)
        batch_y = batch_y.to(device)
        pred = model(batch_x)
        loss = criterion(pred, batch_y)
        if (epoch + 1) % 1000 == 0:
            print(epoch + 1, i, loss.item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

for i, label in enumerate(vocab):
    W, WT = model.parameters()
    x, y = float(W[i][0]), float(W[i][1])
    plt.scatter(x, y)
    plt.annotate(label, xy=(x, y), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
plt.show()

论文:
文章介绍了两种基本模型:CBOW和Skip-Gram模型的原理和求导的细节,之后介绍了优化模型的方法:分层softmax和负采样技术。是理解word2vec的非常好的资料
word2vec Parameter Learning Explained

译文

https://zhuanlan.zhihu.com/p/52787964

你可能感兴趣的:(推荐系统,NLP,pytorch)