一、论文总览:
摘要 Abstract:提出了一种新的词向量学习方法GloVe, GloVe能够同时利用全局的统计信息和局部的上下文信息从而学习到非常好的词向量。
Introduction:矩阵分解和Word2vec学习词向量的方式各有优劣,本文提出的GloVe同时学习者两种信息
Related Word:前人工作介绍,主要介绍矩阵分解和Word2vec两种方法
The GloVe Model:介绍GloVe的推导过程,GloVe与其他模型之间的联系,GloVe的复杂度分析
Experiments:实验探究GloVe模型的效果,以及对某些超参数的分析
Conclusion:对全文进行总结
二、目标:
(一)GloVe
GloVe算法推导
GloVe与其他模型的比较
GloVe复杂度分析
(二)实验分析
词对推理实验结果
命名实体识别实验结果
向量长度和窗口大小的超参分析
语料大小超参分析
Word2Vec对比实验
(三)代码实现
三、论文详解
(〇)摘要 Abstract
Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.
当前词向量学习模型能够通过向量的算术计算捕捉词之间细微的语法和语义规律,但是这种规律背后的原理依旧不清楚。
经过仔细的分析,我们发现了一些有助于这种词向量规律的特性,并基于词提出了一种新的对数双线性回归模型,这种模型能够利用全局矩阵分解和局部上下文的优点来学习词向量。
我们的模型通过只在共现矩阵中的非0位置训练达到高效训练的目的。
我们的模型在词对推理任务上得到75%的准确率,并且在多个任务上得到最优结果。
(一)Introduction
Semantic vector space models of language represent each word with a real-valued vector. These vectors can be used as features in a variety of applications, such as information retrieval (Manning et al., 2008), document classification (Sebastiani, 2002), question answering (Tellex et al., 2003), named entity recognition (Turian et al., 2010), and parsing (Socher et al., 2013).
Most word vector methods rely on the distance or angle between pairs of word vectors as the primary method for evaluating the intrinsic quality of such a set of word representations. Recently, Mikolov et al. (2013c) introduced a new evaluation scheme based on word analogies that probes the finer structure of the word vector space by examining not the scalar distance between word vectors, but rather their various dimensions of difference. For example, the analogy “king is to queen as man is to woman” should be encoded in the vector space by the vector equation king − queen = man − woman. This evaluation scheme favors models that produce dimensions of meaning, thereby capturing the multi-clustering idea of distributed representations (Bengio, 2009).
The two main model families for learning word vectors are: 1) global matrix factorization methods, such as latent semantic analysis (LSA) (Deerwester et al., 1990) and 2) local context window methods, such as the skip-gram model of Mikolov et al. (2013c). Currently, both families suffer significant drawbacks. While methods like LSA efficiently leverage statistical information, they do relatively poorly on the word analogy task, indicating a sub-optimal vector space structure. Methods like skip-gram may do better on the analogy task, but they poorly utilize the statistics of the corpus since they train on separate local context windows instead of on global co-occurrence counts.
In this work, we analyze the model properties necessary to produce linear directions of meaning and argue that global log-bilinear regression models are appropriate for doing so. We propose a specific weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics. The model produces a word vector space with meaningful substructure, as evidenced by its state-of-the-art performance of 75% accuracy on the word analogy dataset.
We also demonstrate that our methods outperform other current methods on several word similarity tasks, and also on a common named entity recognition (NER) benchmark. We provide the source code for the model as well as trained word vectors at http://nlp. stanford.edu/projects/glove/.
1.矩阵分解方法
词共现矩阵:缺点:在词对推理任务上表现特别差
2.基于上下文的向量学习方法
word2vec:缺点:无法使用全局信息
(二)Related Work
Matrix Factorization Methods. Matrix factorization methods for generating low-dimensional word representations have roots stretching as far back as LSA. These methods utilize low-rank approximations to decompose large matrices that capture statistical information about a corpus. The particular type of information captured by such matrices varies by application. In LSA, the matrices are of “term-document” type, i.e., the rows correspond to words or terms, and the columns correspond to different documents in the corpus. In contrast, the Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996), for example, utilizes matrices of “term-term” type, i.e., the rows and columns correspond to words and the entries correspond to the number of times a given word occurs in the context of another given word.
A main problem with HAL and related methods is that the most frequent words contribute a disproportionate amount to the similarity measure: the number of times two words co-occur with the or and, for example, will have a large effect on their similarity despite conveying relatively little about their semantic relatedness. A number of techniques exist that addresses this shortcoming of HAL, such as the COALS method (Rohde et al., 2006), in which the co-occurrence matrix is first transformed by an entropy- or correlation-based normalization. An advantage of this type of transformation is that the raw co-occurrence counts, which for a reasonably sized corpus might span 8 or 9 orders of magnitude, are compressed so as to be distributed more evenly in a smaller interval. A variety of newer models also pursue this approach, including a study (Bullinaria and Levy, 2007) that indicates that positive pointwise mutual information (PPMI) is a good transformation. More recently, a square root type transformation in the form of Hellinger PCA (HPCA) (Lebret and Collobert, 2014) has been suggested as an effective way of learning word representations.
Shallow Window-Based Methods.Another approach is to learn word representations that aid in making predictions within local context windows. For example, Bengio et al. (2003) introduced a model that learns word vector representations as part of a simple neural network architecture for language modeling. Collobert and Weston (2008) decoupled the word vector training from the downstream training objectives, which paved the way for Collobert et al. (2011) to use the full context of a word for learning the word representations, rather than just the preceding context as is the case with language models.
Recently, the importance of the full neural network structure for learning useful word representations has been called into question. The skip-gram and continuous bag-of-words (CBOW) models of Mikolov et al. (2013a) propose a simple single-layer architecture based on the inner product between two word vectors. Mnih and Kavukcuoglu (2013) also proposed closely-related vector log-bilinear models, vLBL and ivLBL, and Levy et al. (2014) proposed explicit word embeddings based on a PPMI metric.
In the skip-gram and ivLBL models, the objective is to predict a word’s context given the word itself, whereas the objective in the CBOW and vLBL models is to predict a word given its context. Through evaluation on a word analogy task, these models demonstrated the capacity to learn linguistic patterns as linear relationships between the word vectors.
Unlike the matrix factorization methods, the shallow window-based methods suffer from the disadvantage that they do not operate directly on the co-occurrence statistics of the corpus. Instead, these models scan context windows across the entire corpus, which fails to take advantage of the vast amount of repetition in the data.
(三)
(四)Experiments
We begin with a simple example that showcases how certain aspects of meaning can be extracted directly from co-occurrence probabilities. Consider two words i and j that exhibit a particular aspect of interest; for concreteness, suppose we are interested in the concept of thermodynamic phase, for which we might take i = ice and j = steam. The relationship of these words can be examined by studying the ratio of their co-occurrence probabilities with various probe words, k. For words k related to ice but not steam, say k = solid, we expect the ratio Pik /Pjk will be large. Similarly, for words k related to steam but not ice, say k = gas, the ratio should be small. For words k like water or fashion, that are either related to both ice and steam, or to neither, the ratio should be close to one. Table 1 shows these probabilities and their ratios for a large corpus, and the numbers confirm these expectations. Compared to the raw probabilities, the ratio is better able to distinguish relevant words (solid and gas) from irrelevant words (water and fashion) and it is also better able to discriminate between the two relevant words.
观察分析得出共现矩阵的概率比值可以用来区分词。
实验结果及分析:
In Fig. 2, we show the results of experiments that vary vector length and context window. A context window that extends to the left and right of a target word will be called symmetric, and one which extends only to the left will be called asymmetric. In (a), we observe diminishing returns for vectors larger than about 200 dimensions. In (b) and (c), we examine the effect of varying the window size for symmetric and asymmetric context windows. Performance is better on the syntactic subtask for small and asymmetric context windows, which aligns with the intuition that syntactic information is mostly drawn from the immediate context and can depend strongly on word order. Semantic information, on the other hand, is more frequently non-local, and more of it is captured with larger window sizes.
翻译:在图2中,我们展示了改变矢量长度和上下文窗口的实验结果。扩展到目标词的左右两边的上下文窗口称为对称的,只扩展到左边的上下文窗口称为不对称的。在(a)中,我们观察到大于200维的向量的收益递减。在(b)和(c)中,我们研究了改变窗口大小对对称和非对称上下文窗口的影响。对于较小且不对称的上下文窗口,在句法子任务上的性能更好,这与句法信息大多来自于直接上下文、对词序的依赖性强的直觉是一致的。另一方面,语义信息更多地是非本地的,而且更多的语义信息是通过更大的窗口大小捕获的。
多个词相似度任务上取得最好的结果
向量长度对结果的影响:
窗口大小对结果的影响:语义
4.5 Model Analysis: Corpus Size
In Fig. 3, we show performance on the word analogy task for 300-dimensional vectors trained on different corpora. On the syntactic subtask, there is a monotonic increase in performance as the corpus size increases. This is to be expected since larger corpora typically produce better statistics. Interestingly, the same trend is not true for the semantic subtask, where the models trained on the smaller Wikipedia corpora do better than those trained on the larger Gigaword corpus. This is likely due to the large number of city- and countrybased analogies in the analogy dataset and the fact that Wikipedia has fairly comprehensive articles for most such locations. Moreover, Wikipedia’s entries are updated to assimilate new knowledge, whereas Gigaword is a fixed news repository with outdated and possibly incorrect information.
翻译:在图3中,我们展示了在不同语料库上训练的300维向量进行单词类比任务的性能。在语法子任务上,随着语料库大小的增加,表现呈单调递增。这是可以预料的,因为较大的语料库通常产生更好的统计数据。有趣的是,同样的趋势在语义子任务中并不成立,在较小的Wikipedia语料库中训练的模型比在较大的Gigaword语料库中训练的模型表现更好。这可能是由于类比数据集中有大量基于城市和国家的类比,而且维基百科对大多数此类地点都有相当全面的文章。此外,维基百科的词条会更新以吸收新知识,而Gigaword只是一个固定的新闻知识库,里面有过时的、甚至可能是错误的信息。
训练语料对结果的影响:语法任务上语料越多越好、语义上也看质量。
训练时间:GloVe是迭代次数影响时间,而word2vec是负采样的个数影响时间。
同样的时间下GloVe比word2vec快很多
四、研究成果与意义
(一)在词对推理数据集上取得最好的结果
(二)公布了一系列基于GloVe的预训练词向量
(三)推动了基于深度学习的自然语言处理的发展
五、论文总结
关键点
矩阵分解的词向量学习方法
基于上下文的词向量学习方法
预训练词向量
创新点
提出了一种新的词向量训练模型——GloVe
在多个任务上取得最好的结果
公布了一系列预训练的词向量
启发点
相对原始概率,概率的比值更能够区分相关的词和不相关的词,并且能够区分两种相关的词。
提出一种新的对数双线性回归模型,这种模型结合全局矩阵分解和局部上下问的优点。