[Word2Vec系列] Negative Sampling

http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Intro

skip-gram最开始就是生成单个单词的词向量然而并不能生成词组的词向量
添加了sub-sampling能提高训练速度
we present a simplified variant of Noise Contrastive Estimation (NCE) for training the Skip-gram model that results in faster training and better vector representations for frequent words, compared to more complex hierarchical softmax that was used in the prior work

Skip-gram model

[Word2Vec系列] Negative Sampling_第1张图片

公式:

[Word2Vec系列] Negative Sampling_第2张图片

w_t是中心词,2c是window size

中心词对周边词的概率预测是通过softmax算出来的


[Word2Vec系列] Negative Sampling_第3张图片

Hierarchical Softmax

构造了一颗二叉树,减少运算量

[Word2Vec系列] Negative Sampling_第4张图片

n(w, j) 从root到w的第j个点
L(w) 这条path的长度
所以,n(w, 1) = root, n(w, L(w)) = w
[[x]]表示x为true则1反之则0
\sigma是sigmoid函数

Negative Sampling

Objective function

[Word2Vec系列] Negative Sampling_第5张图片

The main difference between the Negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples. And while NCE approximately maximizes the log probability of the softmax, this property is not important for our application.

这块不太懂

Subsampling of frequent words

想the a这种词,高频没啥用,希望经过trainning之后,词向量变化不大
对于每一个词,按照p的概率丢掉

[Word2Vec系列] Negative Sampling_第6张图片

f是频率,t是阈值,一般为10^-5。

We chose this subsampling formula because it aggressively subsamples words whose frequency is greater than t while preserving the ranking of the frequencies

对于多个单词组成的词组,通过如下公式生成
迭代2-4轮,可以生成长词组

[Word2Vec系列] Negative Sampling_第7张图片

系数是为了不让频率过低的两个词形成词组。
这里是一个bigram

你可能感兴趣的:([Word2Vec系列] Negative Sampling)