《Distributed Representations of Words and Phrases and their Compositionality》论文阅读笔记

1.论文的贡献

对现有的Skip-gram model进行拓展,提高词向量的质量和学学习速度。

作者提出了创新方法:

  • 通过对常用词的部分采样加快训练速度以及提高词向量的训练质量。
  • 提出了negative sampling作为hierarchical softmax的替代方法。
2.前人的主要贡献
  • Mikolov 提出了Skip-gram model,这个模型能够快速地学习得到高质量的词向量,因为相较于传统使用神经网络学习词向量的模型,Skip-gram不涉及稠密矩阵的相乘。并且学习得到的词向量能够通过线性关系表达某些模式。We found that simple vector addition can often produce meaningful results. This compositionality suggests that a non-obvious degree of language understanding can be obtained by using basic mathematical operations on the word vector representations.(这种组合性表明,通过对单词向量表示的基本数学运算,可以获得非明显程度的语言理解。)
  • Recursive Autoencoders uses phrase vectors instead of the word vectors. 通过词向量的组合来表示句子的语义。
  • An alternative to the hierarchical softmax is Noise Contrastive Estimation (NCE), which was introduced by Gutmann and Hyvarinen and applied to language modeling by Mnih and Teh. NCE posits that a good model should be able to differentiate data from noise by means of logistic regression
3.现有方法未解决的问题
  • 词向量无法很好表示某些短语和俚语。无法通过词向量推算得到某些短语。
4.预备知识
The Skip-gram model

模型的功能:The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document.
《Distributed Representations of Words and Phrases and their Compositionality》论文阅读笔记_第1张图片

Hierarchical Softmax

《Distributed Representations of Words and Phrases and their Compositionality》论文阅读笔记_第2张图片

5.核心算法
5.1.Negative Sampling

《Distributed Representations of Words and Phrases and their Compositionality》论文阅读笔记_第3张图片

Negative Sampling是作者针对具体问题对NCE简化得来的。

两者的不同:The main difference between the Negative sampling and NCE is that NCE needs both
samples and the numerical probabilities of the noise distribution, while Negative sampling uses only samples.

5.2.Subsampling of Frequent Words

思考出发点:Frequent words usually provide less information value than the rare words.

The vector representations of frequent words do not change significantly after training on several million examples.

作者提出的经验公式:
《Distributed Representations of Words and Phrases and their Compositionality》论文阅读笔记_第4张图片

We chose this subsampling formula because it aggressively subsamples words whose frequency
is greater than t while preserving the ranking of the frequencies. Although this subsampling formula was chosen heuristically, we found it to work well in practice. It accelerates learning and even
significantly improves the accuracy. (经验公式的选择是启发式的)

6.实验

Omit…

7.对实验结果的解释
Additive Compositionality(语义合成性质)

《Distributed Representations of Words and Phrases and their Compositionality》论文阅读笔记_第5张图片

8.结论

《Distributed Representations of Words and Phrases and their Compositionality》论文阅读笔记_第6张图片

你可能感兴趣的:(论文阅读笔记)