Words that occur in similar contexts tend to have similar meanings. This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis.
Vector semantics instantiates this linguistic hypothesis by learning representations of the meaning of words directly from their distributions in texts. These representations are used in every natural language processing application that makes use of meaning. These word representations are also the first example we will see in the book of representation learning, automatically learning useful representations of the input text. Finding such unsupervised ways to learn representations of the input, instead of creating representations by hand via feature engineering, is an important focus of recent NLP research (Bengio et al., 2013).
A model of word meaning should allow us to draw useful inferences that will help us solve meaning-related tasks like question-answering, summarization, paraphrase or plagiarism detection, and dialogue.
lexical semantics: linguistic study of word meaning
Lemma and Sense
Relations between words or senses
Word Similarity: While words don’t have many synonyms, most words do have lots of similar words. Cat is not a synonym of dog, but cats and dogs are certainly similar words. In moving from synonymy to similarity, it will be useful to shift from talking about relations between word senses (like synonymy) to relations between words (like similarity). Dealing with words avoids having to commit to a particular
representation of word senses, which will turn out to simplify our task.
One way of getting values for word similarity is to ask humans to judge how similar one word is to another. A number of datasets have resulted from such experiments. For example the SimLex-999 dataset (Hill et al., 2015) gives values on a scale from 0 to 10, like the examples below, which range from near-synonyms (vanish, disappear) to pairs that scarcely seem to have anything in common (hole, agreement).
Word Relatedness: The meaning of two words can be related in ways others than relatedness similarity. One such class of connections is called word relatedness (Budanitsky association and Hirst, 2006), also traditionally called word association in psychology. For example, coffee and cup.
One common kind of relatedness between words is if they belong to the same semantic field. A semantic field is a set of words which cover a particular semantic domain and bear structured relations with each other.
Semantic fields are also related to topic models, like Latent Dirichlet Allocation, LDA, which apply unsupervised learning on large sets of texts to induce sets of associated words from text. Semantic fields and topic models are a very useful tool for discovering topical structure in documents.
Semantic Frames and Roles: Closely related to semantic fields is the idea of a semantic frame. A semantic frame is a set of words that denote perspectives or participants in a particular type of event ( comparing to semantic field).
Taxonomic Relations: Another way word senses can be related is taxonomically. A word (or sense) is a hyponym of another word or sense if the first is more specific, denoting a subclass of the other. Conversely, we say that vehicle is a hypernym of car, and animal is a hypernym of dog. It is unfortunate that the two words (hypernym and hyponym) are very similar and hence easily confused; for this reason, the word superordinate is often used instead of hypernym.
Hypernymy can also be defined in terms of entailment. Being an A A A entails being a B B B, or ∀ x A ( x ) ⇒ B ( x ) \forall x A(x) \Rightarrow B(x) ∀xA(x)⇒B(x). Another name for the hypernym/hyponym structure IS-A is the IS-A hierarchy, in which we say A IS-A B, or B subsumes A.
Connotation: Finally, words have affective meanings or connotations. The word connotation has different meanings in different fields, but here we use it to mean the aspects of a word’s meaning that are related to a writer or reader’s emotions, sentiment, opinions, or evaluations. For example some words have positive connotations (happy) while others have negative connotations (sad). Some words describe
positive evaluation (great, love) and others negative evaluation (terrible, hate). Positive or negative evaluation expressed through language is called sentiment, as we saw in Chapter 4, and word sentiment plays a role in important tasks like sentiment analysis, stance detection, and many aspects of natural language processing to the language of politics and consumer reviews.
Early work on affective meaning (Osgood et al., 1957) found that words varied along three important dimensions of affective meaning. These are now generally called valence, arousal, and dominance, defined as follows:
valence: the pleasantness of the stimulus
arousal: the intensity of emotion provoked by the stimulus
dominance: the degree of control exerted by the stimulus
Thus words like happy or satisfied are high on valence, while unhappy or annoyed are low on valence. Excited or frenzied are high on arousal, while relaxed or calm are low on arousal. Important or controlling are high on dominance, while awed or influenced are low on dominance. Each word is thus represented by three numbers, corresponding to its value on each of the three dimensions.
Osgood et al. (1957) noticed that in using these 3 numbers to represent the meaning of a word, the model was representing each word as a point in a three dimensional space, a vector whose three dimensions corresponded to the word’s rating on the three scales. This revolutionary idea that word meaning could be represented as a point in space (e.g., that part of the meaning of heartbreak can be represented as the point [ 2.45 , 5.65 , 3.58 ] [2.45,5.65,3.58] [2.45,5.65,3.58] was the first expression of the vector semantics models that we introduce next.
The idea of vector semantics is to represent a word as a point in some multidimensional semantic space. Vectors for representing words are generally called embeddings, because the word is embedded in a particular vector space.
If words were represented as embeddings, we could assign sentiment as long as words with similar meanings as the test set words occurred in the training set. Vector semantic models are also extremely practical because they can be learned automatically from text without any complex labeling or supervision.
As a result of these advantages, vector models of meaning are now the standard way to represent the meaning of words in NLP. In this chapter we’ll introduce the two most commonly used models. First is the tf-idf model, often used a a baseline, in which the meaning of a word is defined by a simple function of the counts of nearby words. We will see that this method results in very long vectors that are sparse, i.e. contain mostly zeros (since most words simply never occur in the context of others).
Then we’ll introduce the word2vec model, one of a family of models that are ways of constructing short, dense vectors that have useful semantic properties.
We’ll also introduce the cosine, the standard way to use embeddings (vectors) to compute functions like semantic similarity, the similarity between two words, two sentences, or two documents, an important tool in practical applications like question answering, summarization, or automatic essay grading.
Vector or distributional models of meaning are generally based on a co-occurrence matrix, a way of representing how often words co-occur.
In a term-document matrix, each row represents a word in the vocabulary and each column represents a document from some collection of documents.
The term-document matrix was first defined as part of the vector space model of information retrieval (Salton, 1971). In this model, a document is represented as a count vector.
In term-document matrices, the vectors representing each document would have dimensionality ∣ V ∣ |V| ∣V∣, the vocabulary size.
Term-document matrices were originally defined as a means of finding similar documents for the task of document information retrieval. Two documents that are similar will tend to have similar words, and if two documents have similar words their column vectors will tend to be similar.
Information retrieval (IR) is the task of finding the document d d d from the D D D documents in some collection that best matches a query q q q. For IR we’ll therefore also represent a query by a vector, also of length V V V, and we’ll need a way to compare two vectors to find how similar they are. (Doing IR will also require efficient ways to store and manipulate these vectors, which is accomplished by making use of the convenient fact that these vectors are sparse, i.e., mostly zeros).
Rather than the term-document matrix we use the term-term matrix, more commonly called the word-word matrix or the term-context matrix, in which the columns are labeled by words rather than documents. This matrix is thus of dimensionality ∣ V ∣ × ∣ V ∣ |V|\times|V| ∣V∣×∣V∣ and each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. It is most common to use smaller contexts, generally a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ± 4 \pm 4 ±4 word window around the row word.
By far the most common similarity metric is the cosine of the angle between the vectors. The cosine—like most measures for vector similarity used in NLP—is based on the dot product operator from linear algebra, also called the inner product:
dot-product ( v ⃗ , w ⃗ ) = v ⃗ ⋅ w ⃗ = ∑ i = 1 N v i w i = v 1 w 1 + v 2 w 2 + … + v N w N \textrm{dot-product}(\vec v,\vec w) =\vec v\cdot \vec w= \sum_{i=1}^N v_iw_i = v_1w_1 + v_2w_2 + \ldots + v_Nw_N dot-product(v,w)=v⋅w=i=1∑Nviwi=v1w1+v2w2+…+vNwN
The dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alternatively, vectors that have zeros in different dimensions—orthogonal vectors—will have a dot product of 0, representing their strong dissimilarity.
This raw dot-product, however, has a problem as a similarity metric: it favors long vectors. The vector length is defined as
∣ v ⃗ ∣ = ∑ i = 1 N v i 2 |\vec v| = \sqrt {\sum_{i=1}^N v_i^2} ∣v∣=i=1∑Nvi2
The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them. The raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are regardless of their frequency.
The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. The cosine similarity metric between two vectors v ⃗ \vec v v and w ⃗ \vec w w thus can be computed as:
cosine ( v ⃗ , w ⃗ ) = v ⃗ ⋅ w ⃗ ∣ v ⃗ ∣ ∣ w ⃗ ∣ = ∑ i = 1 N v i w i ∑ i = 1 N v i 2 ∑ i = 1 N w i 2 \textrm{cosine}(\vec v,\vec w)=\frac{\vec v\cdot\vec w}{|\vec v||\vec w|}=\frac{\sum_{i=1}^N v_i w_i}{\sqrt{\sum_{i=1}^N v_i^2}\sqrt{\sum_{i=1}^N w_i^2}} cosine(v,w)=∣v∣∣w∣v⋅w=∑i=1Nvi2∑i=1Nwi2∑i=1Nviwi
It’s a bit of a paradox. Word that occur nearby frequently (maybe sugar appears often in our corpus near apricot) are more important than words that only appear once or twice. Yet words that are too frequent—ubiquitous, like the or good— are unimportant. How can we balance these two conflicting constraints?
The tf-idf algorithm algorithm is the product of two terms, each term capturing one of these two intuitions:
The first is the term frequency (Luhn, 1957): the frequency of the word in the document. Normally we want to downweight the raw frequency a bit, since a word appearing 100 times in a document doesn’t make that word 100 times more likely to be relevant to the meaning of the document. So we generally use the log 10 \log_{10} log10 of the frequency, resulting in the following definition for the term frequency weight:
KaTeX parse error: Unknown column alignment: [ at position 43: …\{\begin{array}[̲ll] \\ 1 + \log…
The second factor is used to give a higher weight to words that occur only in a few documents. Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection; terms that occur frequently across the entire collection aren’t as helpful. The document frequency df t \textrm{df}_t dft of a term t t t is simply the number of documents it occurs in. By contrast, the collection frequency of a term is the total number of times the word appears in the whole collection in any document.
We assign importance to these more discriminative words via the inverse document frequency or idf term weight (Sparck Jones, 1972). The idf is defined using the fraction N / df t N/\textrm{df}_t N/dft, where N N N is the total number of documents in the collection, and df t \textrm{df}_t dft is the number of documents in which term t t t occurs. The fewer documents in which a term occurs, the higher this weight. The lowest weight of 1 is assigned to terms that occur in all the documents. It’s usually clear what counts as a document: in Shakespeare
we would use a play; when processing a collection of encyclopedia articles like Wikipedia, the document is a Wikipedia page; in processing newspaper articles, the document is a single article. Occasionally your corpus might not have appropriate document divisions and you might need to break up the
corpus into documents yourself for the purposes of computing idf.
Because of the large number of documents in many collections, this measure is usually squashed with a log function. The resulting definition for inverse document frequency (idf) is thus
idf = log 10 ( N df t ) \textrm{idf}=\log_{10}\left(\frac{N}{\textrm{df}_t}\right) idf=log10(dftN)
The tf-idf weighting of the value for word t t t in document d d d, w t , d w_{t,d} wt,d thus combines term frequency with idf:
w t , d = tf t , d × idf t w_{t,d} = \textrm{tf}_{t,d}\times \textrm{idf}_t wt,d=tft,d×idft
In summary, the vector semantics model we’ve described so far represents a target word as a vector with dimensions corresponding to all the words in the vocabulary (length ∣ V ∣ |V| ∣V∣, with vocabularies of 20,000 to 50,000), which is also sparse (most values are zero). The values in each dimension are the frequency with which the target word co-occurs with each neighboring context word, weighted by tf-idf.
The tf-idf vector model can also be used to decide if two documents are similar. We represent a document by taking the vectors of all the words in the document, and centroid computing the centroid of all those vectors. The centroid is the multidimensional version of the mean; the centroid of a set of vectors is a single vector that has the minimum sum of squared distances to each of the vectors in the set. Given k k k word document vector vectors w 1 , w 2 , … , w k w_1,w_2,\ldots, w_k w1,w2,…,wk, the centroid document vector d is:
d = w 1 + w 2 + … + w k k d = \frac{w_1 + w_2 + \ldots+ w_k}{k} d=kw1+w2+…+wk
Given two documents, we can then compute their document vectors d 1 d_1 d1 and d 2 d_2 d2, and estimate the similarity between the two documents by cos ( d 1 , d 2 ) \cos(d_1,d_2) cos(d1,d2).
An alternative weighting function to tf-idf is called PPMI (positive pointwise mutual information). PPMI draws on the intuition that best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance.
Pointwise mutual information (Fano, 1961) is one of the most important concepts in NLP. It is a measure of how often two events x x x and y y y occur, compared with what we would expect if they were independent:
I ( x , y ) = log 2 P ( x , y ) P ( x ) P ( y ) I(x,y) =\log_2 \frac{P(x,y)}{P(x)P(y)} I(x,y)=log2P(x)P(y)P(x,y)
The pointwise mutual information between a target word w w w and a context word c c c (Church and Hanks 1989, Church and Hanks 1990) is then defined as:
PMI ( w , c ) = log 2 P ( w , c ) P ( w ) P ( c ) \textrm{PMI}(w,c) = \log_2 \frac{P(w,c)}{P(w)P(c)} PMI(w,c)=log2P(w)P(c)P(w,c)
The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE). The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently; recall that the probability of two independent events both occurring is just the product of the probabilities of the two events. Thus, the ratio gives us an estimate of how much more the two words co-occur than we expect by chance. PMI is a useful tool whenever we need to find words that are strongly associated.
PMI values range from negative to positive infinity. But negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous. To distinguish whether two words whose individual probability is each 1 0 − 6 10^{-6} 10−6 occur together more often than chance, we would need to be certain that the probability of the two occurring together is significantly different than 1 0 − 12 10^{-12} 10−12, and this kind of granularity would require an enormous corpus. Furthermore it’s not clear whether it’s even possible to evaluate such scores of ‘unrelatedness’ with human judgments. For this reason it is more common to use Positive PMI (called PPMI) which replaces all negative PMI values with zero (Church and Hanks 1989, Dagan et al. 1993, Niwa and Nitta 1994):
PPMI ( w , c ) = max ( log 2 P ( w , c ) P ( w ) P ( c ) , 0 ) \textrm{PPMI}(w,c) = \max(\log_2 \frac{P(w,c)}{P(w)P(c)},0) PPMI(w,c)=max(log2P(w)P(c)P(w,c),0)
More formally, let’s assume we have a co-occurrence matrix F F F with W W W rows (words) and C C C columns (contexts), where f i j f_{ij} fij gives the number of times word w i w_i wi occurs in context c j c_j cj. This can be turned into a PPMI matrix where p p m i i j ppmi_{ij} ppmiij gives the PPMI value of word w i w_i wi with context c j c_j cj as follows:
p i j = f i j ∑ i = 1 W ∑ j = 1 C f i j , p i ∗ = ∑ j = 1 C f i j ∑ i = 1 W ∑ j = 1 C f i j , p ∗ j = ∑ i = 1 W f i j ∑ i = 1 W ∑ j = 1 C f i j p_{ij}=\frac{f_{ij}}{\sum_{i=1}^W\sum_{j=1}^Cf_{ij}}, p_{i*}=\frac{\sum_{j=1}^Cf_{ij}}{\sum_{i=1}^W\sum_{j=1}^C f_{ij}},p_{*j}=\frac{\sum_{i=1}^Wf_{ij}}{\sum_{i=1}^W\sum_{j=1}^C f_{ij}} pij=∑i=1W∑j=1Cfijfij,pi∗=∑i=1W∑j=1Cfij∑j=1Cfij,p∗j=∑i=1W∑j=1Cfij∑i=1Wfij
PPMI i j = max ( log 2 p i j p i ∗ p ∗ j , 0 ) \textrm{PPMI}_{ij}=\max(\log_2\frac{p_{ij}}{p_{i*}p_{*j}},0) PPMIij=max(log2pi∗p∗jpij,0)
PMI has the problem of being biased toward infrequent events; very rare words tend to have very high PMI values. One way to reduce this bias toward low frequency events is to slightly change the computation for P ( c ) P(c) P(c), using a different function P α ( c ) P_\alpha (c) Pα(c) that raises contexts to the power of α \alpha α:
PPMI α ( w , c ) = max ( log 2 P ( w , c ) P ( w ) P α ( c ) , 0 ) \textrm{PPMI}_\alpha(w,c) = \max(\log_2 \frac{P(w,c)}{P(w)P_\alpha(c)},0) PPMIα(w,c)=max(log2P(w)Pα(c)P(w,c),0)
P α ( c ) = c o u n t ( c ) α ∑ c c o u n t ( c ) α P_\alpha(c)=\frac{count(c)^\alpha}{\sum_c count(c)^\alpha} Pα(c)=∑ccount(c)αcount(c)α
Levy et al. (2015) found that a setting of α = 0.75 \alpha = 0.75 α=0.75 improved performance of embeddings on a wide range of tasks. This works because raising the probability to α = 0.75 \alpha = 0.75 α=0.75 increases the probability assigned to rare contexts, and hence lowers their PMI ( P α ( c ) > P ( c ) P_\alpha(c) > P(c) Pα(c)>P(c) when c c c is rare).
Another possible solution is Laplace smoothing: Before computing PMI, a small constant k k k (values of 0.1-3 are common) is added to each of the counts, shrinking (discounting) all the non-zero values. The larger the k k k, the more the non-zero counts are discounted.
In this section we introduce one method for very dense, short vectors, skip-gram with negative sampling, sometimes called SGNS. The skip-gram algorithm SGNS is one of two algorithms in a software package called word2vec, and so sometimes the algorithm is loosely referred to as word2vec (Mikolov et al. 2013, Mikolov et al. 2013a). The word2vec methods are fast, efficient to train, and easily available online with code and pretrained embeddings. We point to other embedding methods, like the equally popular GloVe (Pennington et al., 2014), at the end of the chapter.
The intuition of word2vec is that instead of counting how often each word w w w occurs near, say, apricot, we’ll instead train a classifier on a binary prediction task: “Is word w w w likely to show up near apricot?” We don’t actually care about this prediction task; instead we’ll take the learned classifier weights as the word embeddings.
The revolutionary intuition here is that we can just use running text as implicitly supervised training data for such a classifier; a word w w w that occurs near the target word apricot acts as gold ‘correct answer’ to the question “Is word w w w likely to show up near apricot?” This avoids the need for any sort of hand-labeled supervision signal. This idea was first proposed in the task of neural language modeling, when Bengio et al. (2003) and Collobert et al. (2011) showed that a neural language model (a neural network that learned to predict the next word from prior words) could just use the next word in running text as its supervision signal, and could be used to learn an embedding representation for each word as part of doing this prediction task.
We’ll see how to do neural networks in the next chapter, but word2vec is a much simpler model than the neural network language model, in two ways. First, word2vec simplifies the task (making it binary classification instead of word prediction). Second, word2vec simplifies the architecture (training a logistic regression classifier instead of a multi-layer neural network with hidden layers that demand more sophisticated training algorithms). The intuition of skip-gram is:
Let’s start by thinking about the classification task, and then turn to how to train. Imagine a sentence like the following, with a target word apricot and assume we’re using a window of ±2 context words:
... lemon, a [tablespoon of apricot jam, a] pinch ...
c1 c2 t c3 c4 \verb| c1 c2 t c3 c4| c1 c2 t c3 c4
Our goal is to train a classifier such that, given a tuple ( t , c ) (t,c) (t,c) of a target word t t t paired with a candidate context word c c c (for example ( a p r i c o t , j a m ) (apricot, jam) (apricot,jam), or perhaps ( a p r i c o t , a a r d v a r k ) (apricot, aardvark) (apricot,aardvark)) it will return the probability that c c c is a real context word (true for jam, false for aardvark):
P ( + ∣ t , c ) P(+|t,c) P(+∣t,c)
P ( − ∣ t , c ) = 1 − P ( + ∣ t , c ) P(-|t,c)=1-P(+|t,c) P(−∣t,c)=1−P(+∣t,c)
How does the classifier compute the probability P P P? The intuition of the skipgram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. How can we compute similarity between embeddings? Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product). In other words:
S i m i l a r i t y ( t , c ) ≈ t ⋅ c Similarity(t,c) \approx t \cdot c Similarity(t,c)≈t⋅c
Of course, the dot product t ⋅ c t\cdot c t⋅c is not a probability, it’s just a number ranging from 0 to ∞ \infin ∞. (Recall, for that matter, that cosine isn’t a probability either). To turn the dot product into a probability, we’ll use the logistic or sigmoid function σ ( x ) \sigma(x) σ(x), the fundamental core of logistic regression.
P ( + ∣ t , c ) = 1 1 + e − t ⋅ c P(+|t,c)=\frac{1}{1+e^{-t\cdot c}} P(+∣t,c)=1+e−t⋅c1
P ( − ∣ t , c ) = 1 − P ( + ∣ t , c ) = e − t ⋅ c 1 + e − t ⋅ c P(-|t,c)=1-P(+|t,c)=\frac{e^{-t\cdot c}}{1+e^{-t\cdot c}} P(−∣t,c)=1−P(+∣t,c)=1+e−t⋅ce−t⋅c
We need to take account of multiple context words in the window. Skip-gram makes the strong but very
useful simplifying assumption that all context words are independent, allowing us to just multiply their probabilities:
P ( + ∣ t , c 1 : k ) = ∏ i = 1 k 1 1 + e − t ⋅ c i P(+|t,c_{1:k})=\prod_{i=1}^k\frac{1}{1+e^{-t\cdot c_i}} P(+∣t,c1:k)=i=1∏k1+e−t⋅ci1
log P ( + ∣ t , c 1 : k ) = ∑ i = 1 k log 1 1 + e − t ⋅ c i \log P(+|t,c_{1:k})=\sum_{i=1}^k\log\frac{1}{1+e^{-t\cdot c_i}} logP(+∣t,c1:k)=i=1∑klog1+e−t⋅ci1
In summary, skip-gram trains a probabilistic classifier that, given a test target word t t t and its context window of k k k words c 1 : k c_{1:k} c1:k, assigns a probability based on how similar this context window is to the target word. The probability is based on applying the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had embeddings for each word target and context word in the vocabulary. Let’s now turn to learning these embeddings (which is the real goal of training this classifier in the first place).
Word2vec learns embeddings by starting with an initial set of embedding vectors and then iteratively shifting the embedding of each word w w w to be more like the embeddings of words that occur nearby in texts, and less like the embeddings of words that don’t occur nearby.
Let’s start by considering a single piece of the training data, from the sentence above:
... lemon, a [tablespoon of apricot jam, a] pinch ...
c1 c2 t c3 c4 \verb| c1 c2 t c3 c4| c1 c2 t c3 c4
This example has a target word t (apricot), and 4 context words in the L = ± 2 L = \pm 2 L=±2 window, resulting in 4 positive training instances:
positive examples + | |
---|---|
t | c |
apricot | tablespoon |
apricot | of |
apricot | jam |
apricot | a |
For training a binary classifier we also need negative examples, and in fact skipgram uses more negative examples than positive examples, the ratio set by a parameter k k k. So for each of these ( t , c ) (t,c) (t,c) training instances we’ll create k negative samples, each consisting of the target t t t plus a ‘noise word’. A noise word is a random word from the lexicon, constrained not to be the target word t t t. The following table shows the setting where k = 2 k = 2 k=2, so we’ll have 2 negative examples in the negative training set for each positive example t , c t,c t,c.
negative examples - | |||
---|---|---|---|
t | c | t | c |
apricot | aardvark | apricot | twelve |
apricot | puddle | apricot | hello |
apricot | where | apricot | dear |
apricot | coaxial | apricot | forever |
The noise words are chosen according to their weighted unigram frequency p α ( w ) p_\alpha(w) pα(w), where α \alpha α is a weight. If we were sampling according to unweighted frequency p ( w ) p(w) p(w), it would mean that with unigram probability p ( “ t h e ” ) p(“the”) p(“the”) we would choose the word the as a noise word, with unigram probability p ( “ a a r d v a r k ” ) p(“aardvark”) p(“aardvark”) we would choose aardvark, and so on. But in practice it is common to set α = . 75 \alpha = .75 α=.75, i.e. use the weighting p 3 4 ( w ) p_{\frac{3}{4}}(w) p43(w):
P α ( w ) = c o u n t ( w ) α ∑ w ′ c o u n t ( w ′ ) α P_\alpha(w)=\frac{count(w)^\alpha}{\sum_{w'} count(w')^\alpha} Pα(w)=∑w′count(w′)αcount(w)α
Setting α = . 75 \alpha = .75 α=.75 gives better performance because it gives rare noise words slightly higher probability: for rare words, P α ( w ) > P ( w ) P_\alpha(w) > P(w) Pα(w)>P(w).
Given the set of positive and negative training instances, and an initial set of embeddings, the goal of the learning algorithm is to adjust those embeddings such that we
We can express this formally over the whole training set as:
L ( θ ) = ∑ ( t , c ) ∈ + log P ( + ∣ t , c ) + ∑ ( t , c ) ∈ − log P ( − ∣ t , c ) L(\theta)=\sum_{(t,c)\in +}\log P(+|t,c)+\sum_{(t,c)\in -}\log P(-|t,c) L(θ)=(t,c)∈+∑logP(+∣t,c)+(t,c)∈−∑logP(−∣t,c)
Or, focusing in on one word/context pair ( t , c ) (t,c) (t,c) with its k k k noise words n 1 , … , n k n_1,\ldots, n_k n1,…,nk, the learning objective L L L is:
KaTeX parse error: Unknown column alignment: [ at position 15: \begin{array}[̲ll]\\ L(\theta)…
We can then use stochastic gradient descent to train to this objective, iteratively modifying the parameters (the embeddings for each target word t t t and each context word or noise word c c c in the vocabulary) to maximize the objective.
Note that the skip-gram model thus actually learns two separate embeddings for each word w w w: the target embedding t t t and the context embedding c c c. These embeddings are stored in two matrices, the target matrix T T T and the context matrix C C C. So each column i i i of the target matrix T T T is the 1 × d 1\times d 1×d vector embedding t i t_i ti for word i i i in the vocabulary V V V , and each column i i i of the context matrix C C C is a d × 1 d \times 1 d×1 vector embedding c i c_i ci for word i i i in V V V. d d d is the dimension of an embedding and is far less than ∣ V ∣ |V| ∣V∣.
Just as in logistic regression, then, the learning algorithm starts with randomly initialized T T T and C C C matrices, and then walks through the training corpus using gradient descent to move T T T and C C C so as to maximize the objective L ( θ ) L(\theta) L(θ). Thus the matrices T T T and C C C function as the parameters θ \theta θ that logistic regression is tuning.
Once the embeddings are learned, we’ll have two embeddings for each word w i w_i wi: t i t_i ti and c i c_i ci. We can choose to throw away the C C C matrix and just keep T T T, in which case each word w i w_i wi will be represented by the vector t i t_i ti.
Alternatively we can add the two embeddings together, using the summed embedding t i + c i t_i + c_i ti+ci as the new d-dimensional embedding, or we can concatenate them into an embedding of dimensionality 2 d 2d 2d.
As with the simple count-based methods like tf-idf, the context window size L L L effects the performance of skip-gram embeddings, and experiments often tune the parameter L L L on a dev set. One difference from the count-based methods is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighboring words must be predicted).
The simplest way to visualize the meaning of a word w w w embedded in a space is to list the most similar words to w w w sorting all words in the vocabulary by their cosines.
Yet another visualization method is to use a clustering algorithm to show a hierarchical representation of which words are similar to others in the embedding space.
Probably the most common visualization method, however, is to project the 100 dimensions of a word
down into 2 dimensions, using a projection method called tSNE (van der Maaten and Hinton, 2008).
Vector semantic models have a number of parameters. One parameter that is relevant to both sparse tf-idf vectors and dense word2vec vectors is the size of the context window used to collect counts. This is generally between 1 and 10 words on each side of the target word (for a total context of 2-20 words).
The choice depends on on the goals of the representation. Shorter context windows tend to lead to representations that are a bit more syntactic, since the information is coming from immediately nearby words. When the vectors are computed from short context windows, the most similar words to a target word w w w tend to be semantically similar words with the same parts of speech. When vectors are computed from long context windows, the highest cosine words to a target word w w w tend to be words that are topically related but not similar.
It’s also often useful to distinguish two kinds of similarity or association between words (Schutze and Pedersen, 1993). Two words have first-order co-occurrence (sometimes called syntagmatic association) if they are typically nearby each other. Thus wrote is a first-order associate of book or poem. Two words have second-order co-occurrence (sometimes called paradigmatic association) if they have similar neighbors. Thus wrote is a second-order associate of words like said or remarked.
Analogy Another semantic property of embeddings is their ability to capture relational meanings. Mikolov et al. (2013b) and Levy and Goldberg (2014b) show that the offsets between vector embeddings can capture some analogical relations between words. For example, the result of the expression vector(‘king’) - vector(‘man’) + vector(‘woman’) is a vector close to vector(‘queen’). Similarly, they found that the expression vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’).
Embeddings and Historical Semantics: Embeddings can also be a useful tool for studying how meaning changes over time, by computing multiple embedding spaces, each from texts written in a particular time period.
For intrinsic evaluations, the most common metric is to test their performance on similarity, computing the correlation between an algorithm’s word similarity scores and word similarity ratings assigned by humans. WordSim-353 (Finkelstein et al., 2002) is a commonly used set of ratings from 0 to 10 for 353 noun pairs; for example (plane, car) had an average score of 5.77. SimLex-999 (Hill et al., 2015) is a more difficult dataset that quantifies similarity (cup, mug) rather than relatedness (cup, coffee), and including both concrete and abstract adjective, noun and verb pairs. The TOEFL dataset is a set of 80 questions, each consisting of a target word with 4 additional word choices; the task is to choose which is the correct synonym, as in the example: Levied is closest in meaning to: imposed, believed, requested, correlated (Landauer and Dumais, 1997). All of these datasets present words without context.
Slightly more realistic are intrinsic similarity tasks that include context. The Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) offers a richer evaluation scenario, giving human judgments on 2,003 pairs of words in their sentential context, including nouns, verbs, and adjectives. This dataset enables the evaluation of word similarity algorithms that can make use of context words. The semantic textual similarity task (Agirre et al. 2012, Agirre et al. 2015) evaluates the performance of sentence-level similarity algorithms, consisting of a set of pairs of sentences, each pair with human-labeled similarity scores.
Another task used for evaluate is an analogy task, where the system has to solve problems of the form a is to b as c is to d, given a, b, and c and having to find d. Thus given Athens is to Greece as Oslo is to , the system must fill in the word Norway. Or more syntactically-oriented examples: given mouse, mice, and dollar the system must return dollars. Large sets of such tuples have been created (Mikolov et al. 2013, Mikolov et al. 2013b).