Chapter 6 Vector Semantics

Chapter 6 Vector Semantics

Speech and Language Processing ed3 读书笔记

Words that occur in similar contexts tend to have similar meanings. This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis.

Vector semantics instantiates this linguistic hypothesis by learning representations of the meaning of words directly from their distributions in texts. These representations are used in every natural language processing application that makes use of meaning. These word representations are also the first example we will see in the book of representation learning, automatically learning useful representations of the input text. Finding such unsupervised ways to learn representations of the input, instead of creating representations by hand via feature engineering, is an important focus of recent NLP research (Bengio et al., 2013).

6.1 Lexical Semantics

A model of word meaning should allow us to draw useful inferences that will help us solve meaning-related tasks like question-answering, summarization, paraphrase or plagiarism detection, and dialogue.

lexical semantics: linguistic study of word meaning

Lemma and Sense

  • lemma: also called the citation form. The basic form of a word, for example the singular form of a noun or the infinitive form of a verb, as it is shown at the beginning of a dictionary entry. The specific forms of lemma are called wordforms.
  • We call each aspect of the meaning of a lemma a word sense.
  • homonymous: have multiple sense
  • word sense disambiguation: the task of determining which sense of a word is being used in a particular context.

Relations between words or senses

  • synonyms: when one word has a sense whose meaning is identical to a sense of another word, or nearly identical, we say the two senses of those two words are synonyms. A more formal definition of synonymy (between words rather than senses) is that two words are synonymous if they are substitutable one for the other in any sentence without changing the truth conditions of the sentence, the situations in which the sentence would be true. We often say in this case that the two words have the same propositional meaning.
  • principle of contrast is the assumption that a difference in linguistic form is always associated with at least some difference in meaning.
  • antonyms are words with an opposite meaning
  • another group of antonyms, reversives, describe change or movement in opposite directions, such as rise/fall or up/down.
  • Antonyms thus differ completely with respect to one aspect of their meaning—their position on a scale or their direction—but are otherwise very similar, sharing almost all other aspects of meaning. Thus, automatically distinguishing synonyms from antonyms can be difficult.

Word Similarity: While words don’t have many synonyms, most words do have lots of similar words. Cat is not a synonym of dog, but cats and dogs are certainly similar words. In moving from synonymy to similarity, it will be useful to shift from talking about relations between word senses (like synonymy) to relations between words (like similarity). Dealing with words avoids having to commit to a particular
representation of word senses, which will turn out to simplify our task.

One way of getting values for word similarity is to ask humans to judge how similar one word is to another. A number of datasets have resulted from such experiments. For example the SimLex-999 dataset (Hill et al., 2015) gives values on a scale from 0 to 10, like the examples below, which range from near-synonyms (vanish, disappear) to pairs that scarcely seem to have anything in common (hole, agreement).

Word Relatedness: The meaning of two words can be related in ways others than relatedness similarity. One such class of connections is called word relatedness (Budanitsky association and Hirst, 2006), also traditionally called word association in psychology. For example, coffee and cup.

One common kind of relatedness between words is if they belong to the same semantic field. A semantic field is a set of words which cover a particular semantic domain and bear structured relations with each other.

Semantic fields are also related to topic models, like Latent Dirichlet Allocation, LDA, which apply unsupervised learning on large sets of texts to induce sets of associated words from text. Semantic fields and topic models are a very useful tool for discovering topical structure in documents.

Semantic Frames and Roles: Closely related to semantic fields is the idea of a semantic frame. A semantic frame is a set of words that denote perspectives or participants in a particular type of event ( comparing to semantic field).

Taxonomic Relations: Another way word senses can be related is taxonomically. A word (or sense) is a hyponym of another word or sense if the first is more specific, denoting a subclass of the other. Conversely, we say that vehicle is a hypernym of car, and animal is a hypernym of dog. It is unfortunate that the two words (hypernym and hyponym) are very similar and hence easily confused; for this reason, the word superordinate is often used instead of hypernym.

Hypernymy can also be defined in terms of entailment. Being an A A A entails being a B B B, or ∀ x A ( x ) ⇒ B ( x ) \forall x A(x) \Rightarrow B(x) xA(x)B(x). Another name for the hypernym/hyponym structure IS-A is the IS-A hierarchy, in which we say A IS-A B, or B subsumes A.

Connotation: Finally, words have affective meanings or connotations. The word connotation has different meanings in different fields, but here we use it to mean the aspects of a word’s meaning that are related to a writer or reader’s emotions, sentiment, opinions, or evaluations. For example some words have positive connotations (happy) while others have negative connotations (sad). Some words describe
positive evaluation (great, love) and others negative evaluation (terrible, hate). Positive or negative evaluation expressed through language is called sentiment, as we saw in Chapter 4, and word sentiment plays a role in important tasks like sentiment analysis, stance detection, and many aspects of natural language processing to the language of politics and consumer reviews.

Early work on affective meaning (Osgood et al., 1957) found that words varied along three important dimensions of affective meaning. These are now generally called valence, arousal, and dominance, defined as follows:

valence: the pleasantness of the stimulus
arousal: the intensity of emotion provoked by the stimulus
dominance: the degree of control exerted by the stimulus

Thus words like happy or satisfied are high on valence, while unhappy or annoyed are low on valence. Excited or frenzied are high on arousal, while relaxed or calm are low on arousal. Important or controlling are high on dominance, while awed or influenced are low on dominance. Each word is thus represented by three numbers, corresponding to its value on each of the three dimensions.

Osgood et al. (1957) noticed that in using these 3 numbers to represent the meaning of a word, the model was representing each word as a point in a three dimensional space, a vector whose three dimensions corresponded to the word’s rating on the three scales. This revolutionary idea that word meaning could be represented as a point in space (e.g., that part of the meaning of heartbreak can be represented as the point [ 2.45 , 5.65 , 3.58 ] [2.45,5.65,3.58] [2.45,5.65,3.58] was the first expression of the vector semantics models that we introduce next.

6.2 Vector Semantics

The idea of vector semantics is to represent a word as a point in some multidimensional semantic space. Vectors for representing words are generally called embeddings, because the word is embedded in a particular vector space.

If words were represented as embeddings, we could assign sentiment as long as words with similar meanings as the test set words occurred in the training set. Vector semantic models are also extremely practical because they can be learned automatically from text without any complex labeling or supervision.

As a result of these advantages, vector models of meaning are now the standard way to represent the meaning of words in NLP. In this chapter we’ll introduce the two most commonly used models. First is the tf-idf model, often used a a baseline, in which the meaning of a word is defined by a simple function of the counts of nearby words. We will see that this method results in very long vectors that are sparse, i.e. contain mostly zeros (since most words simply never occur in the context of others).

Then we’ll introduce the word2vec model, one of a family of models that are ways of constructing short, dense vectors that have useful semantic properties.

We’ll also introduce the cosine, the standard way to use embeddings (vectors) to compute functions like semantic similarity, the similarity between two words, two sentences, or two documents, an important tool in practical applications like question answering, summarization, or automatic essay grading.

6.3 Words and Vectors

Vector or distributional models of meaning are generally based on a co-occurrence matrix, a way of representing how often words co-occur.

6.3.1 Vectors and documents

In a term-document matrix, each row represents a word in the vocabulary and each column represents a document from some collection of documents.

The term-document matrix was first defined as part of the vector space model of information retrieval (Salton, 1971). In this model, a document is represented as a count vector.

In term-document matrices, the vectors representing each document would have dimensionality ∣ V ∣ |V| V, the vocabulary size.

Term-document matrices were originally defined as a means of finding similar documents for the task of document information retrieval. Two documents that are similar will tend to have similar words, and if two documents have similar words their column vectors will tend to be similar.

Information retrieval (IR) is the task of finding the document d d d from the D D D documents in some collection that best matches a query q q q. For IR we’ll therefore also represent a query by a vector, also of length V V V, and we’ll need a way to compare two vectors to find how similar they are. (Doing IR will also require efficient ways to store and manipulate these vectors, which is accomplished by making use of the convenient fact that these vectors are sparse, i.e., mostly zeros).

6.3.2 Words as vectors

Rather than the term-document matrix we use the term-term matrix, more commonly called the word-word matrix or the term-context matrix, in which the columns are labeled by words rather than documents. This matrix is thus of dimensionality ∣ V ∣ × ∣ V ∣ |V|\times|V| V×V and each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. It is most common to use smaller contexts, generally a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ± 4 \pm 4 ±4 word window around the row word.

6.4 Cosine for measuring similarity

By far the most common similarity metric is the cosine of the angle between the vectors. The cosine—like most measures for vector similarity used in NLP—is based on the dot product operator from linear algebra, also called the inner product:
dot-product ( v ⃗ , w ⃗ ) = v ⃗ ⋅ w ⃗ = ∑ i = 1 N v i w i = v 1 w 1 + v 2 w 2 + … + v N w N \textrm{dot-product}(\vec v,\vec w) =\vec v\cdot \vec w= \sum_{i=1}^N v_iw_i = v_1w_1 + v_2w_2 + \ldots + v_Nw_N dot-product(v ,w )=v w =i=1Nviwi=v1w1+v2w2++vNwN
The dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alternatively, vectors that have zeros in different dimensions—orthogonal vectors—will have a dot product of 0, representing their strong dissimilarity.

This raw dot-product, however, has a problem as a similarity metric: it favors long vectors. The vector length is defined as
∣ v ⃗ ∣ = ∑ i = 1 N v i 2 |\vec v| = \sqrt {\sum_{i=1}^N v_i^2} v =i=1Nvi2
The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them. The raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are regardless of their frequency.

The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. The cosine similarity metric between two vectors v ⃗ \vec v v and w ⃗ \vec w w thus can be computed as:
cosine ( v ⃗ , w ⃗ ) = v ⃗ ⋅ w ⃗ ∣ v ⃗ ∣ ∣ w ⃗ ∣ = ∑ i = 1 N v i w i ∑ i = 1 N v i 2 ∑ i = 1 N w i 2 \textrm{cosine}(\vec v,\vec w)=\frac{\vec v\cdot\vec w}{|\vec v||\vec w|}=\frac{\sum_{i=1}^N v_i w_i}{\sqrt{\sum_{i=1}^N v_i^2}\sqrt{\sum_{i=1}^N w_i^2}} cosine(v ,w )=v w v w =i=1Nvi2 i=1Nwi2 i=1Nviwi

6.5 TF-IDF: Weighing terms in the vector

It’s a bit of a paradox. Word that occur nearby frequently (maybe sugar appears often in our corpus near apricot) are more important than words that only appear once or twice. Yet words that are too frequent—ubiquitous, like the or good— are unimportant. How can we balance these two conflicting constraints?

The tf-idf algorithm algorithm is the product of two terms, each term capturing one of these two intuitions:

  1. The first is the term frequency (Luhn, 1957): the frequency of the word in the document. Normally we want to downweight the raw frequency a bit, since a word appearing 100 times in a document doesn’t make that word 100 times more likely to be relevant to the meaning of the document. So we generally use the log ⁡ 10 \log_{10} log10 of the frequency, resulting in the following definition for the term frequency weight:
    KaTeX parse error: Unknown column alignment: [ at position 43: …\{\begin{array}[̲ll] \\ 1 + \log…

  2. The second factor is used to give a higher weight to words that occur only in a few documents. Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection; terms that occur frequently across the entire collection aren’t as helpful. The document frequency df t \textrm{df}_t dft of a term t t t is simply the number of documents it occurs in. By contrast, the collection frequency of a term is the total number of times the word appears in the whole collection in any document.

    We assign importance to these more discriminative words via the inverse document frequency or idf term weight (Sparck Jones, 1972). The idf is defined using the fraction N / df t N/\textrm{df}_t N/dft, where N N N is the total number of documents in the collection, and df t \textrm{df}_t dft is the number of documents in which term t t t occurs. The fewer documents in which a term occurs, the higher this weight. The lowest weight of 1 is assigned to terms that occur in all the documents. It’s usually clear what counts as a document: in Shakespeare
    we would use a play; when processing a collection of encyclopedia articles like Wikipedia, the document is a Wikipedia page; in processing newspaper articles, the document is a single article. Occasionally your corpus might not have appropriate document divisions and you might need to break up the
    corpus into documents yourself for the purposes of computing idf.

    Because of the large number of documents in many collections, this measure is usually squashed with a log function. The resulting definition for inverse document frequency (idf) is thus
    idf = log ⁡ 10 ( N df t ) \textrm{idf}=\log_{10}\left(\frac{N}{\textrm{df}_t}\right) idf=log10(dftN)

The tf-idf weighting of the value for word t t t in document d d d, w t , d w_{t,d} wt,d thus combines term frequency with idf:
w t , d = tf t , d × idf t w_{t,d} = \textrm{tf}_{t,d}\times \textrm{idf}_t wt,d=tft,d×idft

6.6 Applications of the tf-idf vector model

In summary, the vector semantics model we’ve described so far represents a target word as a vector with dimensions corresponding to all the words in the vocabulary (length ∣ V ∣ |V| V, with vocabularies of 20,000 to 50,000), which is also sparse (most values are zero). The values in each dimension are the frequency with which the target word co-occurs with each neighboring context word, weighted by tf-idf.

The tf-idf vector model can also be used to decide if two documents are similar. We represent a document by taking the vectors of all the words in the document, and centroid computing the centroid of all those vectors. The centroid is the multidimensional version of the mean; the centroid of a set of vectors is a single vector that has the minimum sum of squared distances to each of the vectors in the set. Given k k k word document vector vectors w 1 , w 2 , … , w k w_1,w_2,\ldots, w_k w1,w2,,wk, the centroid document vector d is:
d = w 1 + w 2 + … + w k k d = \frac{w_1 + w_2 + \ldots+ w_k}{k} d=kw1+w2++wk
Given two documents, we can then compute their document vectors d 1 d_1 d1 and d 2 d_2 d2, and estimate the similarity between the two documents by cos ⁡ ( d 1 , d 2 ) \cos(d_1,d_2) cos(d1,d2).

6.7 Optional: Pointwise Mutual Information (PMI)

An alternative weighting function to tf-idf is called PPMI (positive pointwise mutual information). PPMI draws on the intuition that best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance.

Pointwise mutual information (Fano, 1961) is one of the most important concepts in NLP. It is a measure of how often two events x x x and y y y occur, compared with what we would expect if they were independent:
I ( x , y ) = log ⁡ 2 P ( x , y ) P ( x ) P ( y ) I(x,y) =\log_2 \frac{P(x,y)}{P(x)P(y)} I(x,y)=log2P(x)P(y)P(x,y)
The pointwise mutual information between a target word w w w and a context word c c c (Church and Hanks 1989, Church and Hanks 1990) is then defined as:
PMI ( w , c ) = log ⁡ 2 P ( w , c ) P ( w ) P ( c ) \textrm{PMI}(w,c) = \log_2 \frac{P(w,c)}{P(w)P(c)} PMI(w,c)=log2P(w)P(c)P(w,c)
The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE). The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently; recall that the probability of two independent events both occurring is just the product of the probabilities of the two events. Thus, the ratio gives us an estimate of how much more the two words co-occur than we expect by chance. PMI is a useful tool whenever we need to find words that are strongly associated.

PMI values range from negative to positive infinity. But negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous. To distinguish whether two words whose individual probability is each 1 0 − 6 10^{-6} 106 occur together more often than chance, we would need to be certain that the probability of the two occurring together is significantly different than 1 0 − 12 10^{-12} 1012, and this kind of granularity would require an enormous corpus. Furthermore it’s not clear whether it’s even possible to evaluate such scores of ‘unrelatedness’ with human judgments. For this reason it is more common to use Positive PMI (called PPMI) which replaces all negative PMI values with zero (Church and Hanks 1989, Dagan et al. 1993, Niwa and Nitta 1994):
PPMI ( w , c ) = max ⁡ ( log ⁡ 2 P ( w , c ) P ( w ) P ( c ) , 0 ) \textrm{PPMI}(w,c) = \max(\log_2 \frac{P(w,c)}{P(w)P(c)},0) PPMI(w,c)=max(log2P(w)P(c)P(w,c),0)
More formally, let’s assume we have a co-occurrence matrix F F F with W W W rows (words) and C C C columns (contexts), where f i j f_{ij} fij gives the number of times word w i w_i wi occurs in context c j c_j cj. This can be turned into a PPMI matrix where p p m i i j ppmi_{ij} ppmiij gives the PPMI value of word w i w_i wi with context c j c_j cj as follows:
p i j = f i j ∑ i = 1 W ∑ j = 1 C f i j , p i ∗ = ∑ j = 1 C f i j ∑ i = 1 W ∑ j = 1 C f i j , p ∗ j = ∑ i = 1 W f i j ∑ i = 1 W ∑ j = 1 C f i j p_{ij}=\frac{f_{ij}}{\sum_{i=1}^W\sum_{j=1}^Cf_{ij}}, p_{i*}=\frac{\sum_{j=1}^Cf_{ij}}{\sum_{i=1}^W\sum_{j=1}^C f_{ij}},p_{*j}=\frac{\sum_{i=1}^Wf_{ij}}{\sum_{i=1}^W\sum_{j=1}^C f_{ij}} pij=i=1Wj=1Cfijfij,pi=i=1Wj=1Cfijj=1Cfij,pj=i=1Wj=1Cfiji=1Wfij

PPMI i j = max ⁡ ( log ⁡ 2 p i j p i ∗ p ∗ j , 0 ) \textrm{PPMI}_{ij}=\max(\log_2\frac{p_{ij}}{p_{i*}p_{*j}},0) PPMIij=max(log2pipjpij,0)

PMI has the problem of being biased toward infrequent events; very rare words tend to have very high PMI values. One way to reduce this bias toward low frequency events is to slightly change the computation for P ( c ) P(c) P(c), using a different function P α ( c ) P_\alpha (c) Pα(c) that raises contexts to the power of α \alpha α:
PPMI α ( w , c ) = max ⁡ ( log ⁡ 2 P ( w , c ) P ( w ) P α ( c ) , 0 ) \textrm{PPMI}_\alpha(w,c) = \max(\log_2 \frac{P(w,c)}{P(w)P_\alpha(c)},0) PPMIα(w,c)=max(log2P(w)Pα(c)P(w,c),0)

P α ( c ) = c o u n t ( c ) α ∑ c c o u n t ( c ) α P_\alpha(c)=\frac{count(c)^\alpha}{\sum_c count(c)^\alpha} Pα(c)=ccount(c)αcount(c)α

Levy et al. (2015) found that a setting of α = 0.75 \alpha = 0.75 α=0.75 improved performance of embeddings on a wide range of tasks. This works because raising the probability to α = 0.75 \alpha = 0.75 α=0.75 increases the probability assigned to rare contexts, and hence lowers their PMI ( P α ( c ) > P ( c ) P_\alpha(c) > P(c) Pα(c)>P(c) when c c c is rare).

Another possible solution is Laplace smoothing: Before computing PMI, a small constant k k k (values of 0.1-3 are common) is added to each of the counts, shrinking (discounting) all the non-zero values. The larger the k k k, the more the non-zero counts are discounted.

6.8 Word2vec

In this section we introduce one method for very dense, short vectors, skip-gram with negative sampling, sometimes called SGNS. The skip-gram algorithm SGNS is one of two algorithms in a software package called word2vec, and so sometimes the algorithm is loosely referred to as word2vec (Mikolov et al. 2013, Mikolov et al. 2013a). The word2vec methods are fast, efficient to train, and easily available online with code and pretrained embeddings. We point to other embedding methods, like the equally popular GloVe (Pennington et al., 2014), at the end of the chapter.

The intuition of word2vec is that instead of counting how often each word w w w occurs near, say, apricot, we’ll instead train a classifier on a binary prediction task: “Is word w w w likely to show up near apricot?” We don’t actually care about this prediction task; instead we’ll take the learned classifier weights as the word embeddings.

The revolutionary intuition here is that we can just use running text as implicitly supervised training data for such a classifier; a word w w w that occurs near the target word apricot acts as gold ‘correct answer’ to the question “Is word w w w likely to show up near apricot?” This avoids the need for any sort of hand-labeled supervision signal. This idea was first proposed in the task of neural language modeling, when Bengio et al. (2003) and Collobert et al. (2011) showed that a neural language model (a neural network that learned to predict the next word from prior words) could just use the next word in running text as its supervision signal, and could be used to learn an embedding representation for each word as part of doing this prediction task.

We’ll see how to do neural networks in the next chapter, but word2vec is a much simpler model than the neural network language model, in two ways. First, word2vec simplifies the task (making it binary classification instead of word prediction). Second, word2vec simplifies the architecture (training a logistic regression classifier instead of a multi-layer neural network with hidden layers that demand more sophisticated training algorithms). The intuition of skip-gram is:

  1. Treat the target word and a neighboring context word as positive examples.
  2. Randomly sample other words in the lexicon to get negative samples
  3. Use logistic regression to train a classifier to distinguish those two cases
  4. Use the regression weights as the embeddings

6.8.1 The classifier

Let’s start by thinking about the classification task, and then turn to how to train. Imagine a sentence like the following, with a target word apricot and assume we’re using a window of ±2 context words:

... lemon, a [tablespoon of apricot jam, a] pinch ...

               c1        c2    t    c3    c4 \verb| c1 c2 t c3 c4|                c1        c2    t    c3    c4

Our goal is to train a classifier such that, given a tuple ( t , c ) (t,c) (t,c) of a target word t t t paired with a candidate context word c c c (for example ( a p r i c o t , j a m ) (apricot, jam) (apricot,jam), or perhaps ( a p r i c o t , a a r d v a r k ) (apricot, aardvark) (apricot,aardvark)) it will return the probability that c c c is a real context word (true for jam, false for aardvark):
P ( + ∣ t , c ) P(+|t,c) P(+t,c)

P ( − ∣ t , c ) = 1 − P ( + ∣ t , c ) P(-|t,c)=1-P(+|t,c) P(t,c)=1P(+t,c)

How does the classifier compute the probability P P P? The intuition of the skipgram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. How can we compute similarity between embeddings? Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product). In other words:
S i m i l a r i t y ( t , c ) ≈ t ⋅ c Similarity(t,c) \approx t \cdot c Similarity(t,c)tc
Of course, the dot product t ⋅ c t\cdot c tc is not a probability, it’s just a number ranging from 0 to ∞ \infin . (Recall, for that matter, that cosine isn’t a probability either). To turn the dot product into a probability, we’ll use the logistic or sigmoid function σ ( x ) \sigma(x) σ(x), the fundamental core of logistic regression.
P ( + ∣ t , c ) = 1 1 + e − t ⋅ c P(+|t,c)=\frac{1}{1+e^{-t\cdot c}} P(+t,c)=1+etc1

P ( − ∣ t , c ) = 1 − P ( + ∣ t , c ) = e − t ⋅ c 1 + e − t ⋅ c P(-|t,c)=1-P(+|t,c)=\frac{e^{-t\cdot c}}{1+e^{-t\cdot c}} P(t,c)=1P(+t,c)=1+etcetc

We need to take account of multiple context words in the window. Skip-gram makes the strong but very
useful simplifying assumption that all context words are independent, allowing us to just multiply their probabilities:
P ( + ∣ t , c 1 : k ) = ∏ i = 1 k 1 1 + e − t ⋅ c i P(+|t,c_{1:k})=\prod_{i=1}^k\frac{1}{1+e^{-t\cdot c_i}} P(+t,c1:k)=i=1k1+etci1

log ⁡ P ( + ∣ t , c 1 : k ) = ∑ i = 1 k log ⁡ 1 1 + e − t ⋅ c i \log P(+|t,c_{1:k})=\sum_{i=1}^k\log\frac{1}{1+e^{-t\cdot c_i}} logP(+t,c1:k)=i=1klog1+etci1

In summary, skip-gram trains a probabilistic classifier that, given a test target word t t t and its context window of k k k words c 1 : k c_{1:k} c1:k, assigns a probability based on how similar this context window is to the target word. The probability is based on applying the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had embeddings for each word target and context word in the vocabulary. Let’s now turn to learning these embeddings (which is the real goal of training this classifier in the first place).

6.8.2 Learning skip-gram embeddings

Word2vec learns embeddings by starting with an initial set of embedding vectors and then iteratively shifting the embedding of each word w w w to be more like the embeddings of words that occur nearby in texts, and less like the embeddings of words that don’t occur nearby.

Let’s start by considering a single piece of the training data, from the sentence above:

... lemon, a [tablespoon of apricot jam, a] pinch ...

               c1        c2    t    c3    c4 \verb| c1 c2 t c3 c4|                c1        c2    t    c3    c4

This example has a target word t (apricot), and 4 context words in the L = ± 2 L = \pm 2 L=±2 window, resulting in 4 positive training instances:

positive examples +
t c
apricot tablespoon
apricot of
apricot jam
apricot a

For training a binary classifier we also need negative examples, and in fact skipgram uses more negative examples than positive examples, the ratio set by a parameter k k k. So for each of these ( t , c ) (t,c) (t,c) training instances we’ll create k negative samples, each consisting of the target t t t plus a ‘noise word’. A noise word is a random word from the lexicon, constrained not to be the target word t t t. The following table shows the setting where k = 2 k = 2 k=2, so we’ll have 2 negative examples in the negative training set for each positive example t , c t,c t,c.

negative examples -
t c t c
apricot aardvark apricot twelve
apricot puddle apricot hello
apricot where apricot dear
apricot coaxial apricot forever

The noise words are chosen according to their weighted unigram frequency p α ( w ) p_\alpha(w) pα(w), where α \alpha α is a weight. If we were sampling according to unweighted frequency p ( w ) p(w) p(w), it would mean that with unigram probability p ( “ t h e ” ) p(“the”) p(the) we would choose the word the as a noise word, with unigram probability p ( “ a a r d v a r k ” ) p(“aardvark”) p(aardvark) we would choose aardvark, and so on. But in practice it is common to set α = . 75 \alpha = .75 α=.75, i.e. use the weighting p 3 4 ( w ) p_{\frac{3}{4}}(w) p43(w):
P α ( w ) = c o u n t ( w ) α ∑ w ′ c o u n t ( w ′ ) α P_\alpha(w)=\frac{count(w)^\alpha}{\sum_{w'} count(w')^\alpha} Pα(w)=wcount(w)αcount(w)α
Setting α = . 75 \alpha = .75 α=.75 gives better performance because it gives rare noise words slightly higher probability: for rare words, P α ( w ) > P ( w ) P_\alpha(w) > P(w) Pα(w)>P(w).

Given the set of positive and negative training instances, and an initial set of embeddings, the goal of the learning algorithm is to adjust those embeddings such that we

  • Maximize the similarity of the target word, context word pairs ( t , c ) (t,c) (t,c) drawn from the positive examples
  • Minimize the similarity of the ( t , c ) (t,c) (t,c) pairs drawn from the negative examples.

We can express this formally over the whole training set as:
L ( θ ) = ∑ ( t , c ) ∈ + log ⁡ P ( + ∣ t , c ) + ∑ ( t , c ) ∈ − log ⁡ P ( − ∣ t , c ) L(\theta)=\sum_{(t,c)\in +}\log P(+|t,c)+\sum_{(t,c)\in -}\log P(-|t,c) L(θ)=(t,c)+logP(+t,c)+(t,c)logP(t,c)
Or, focusing in on one word/context pair ( t , c ) (t,c) (t,c) with its k k k noise words n 1 , … , n k n_1,\ldots, n_k n1,,nk, the learning objective L L L is:
KaTeX parse error: Unknown column alignment: [ at position 15: \begin{array}[̲ll]\\ L(\theta)…
We can then use stochastic gradient descent to train to this objective, iteratively modifying the parameters (the embeddings for each target word t t t and each context word or noise word c c c in the vocabulary) to maximize the objective.

Note that the skip-gram model thus actually learns two separate embeddings for each word w w w: the target embedding t t t and the context embedding c c c. These embeddings are stored in two matrices, the target matrix T T T and the context matrix C C C. So each column i i i of the target matrix T T T is the 1 × d 1\times d 1×d vector embedding t i t_i ti for word i i i in the vocabulary V V V , and each column i i i of the context matrix C C C is a d × 1 d \times 1 d×1 vector embedding c i c_i ci for word i i i in V V V. d d d is the dimension of an embedding and is far less than ∣ V ∣ |V| V.

Just as in logistic regression, then, the learning algorithm starts with randomly initialized T T T and C C C matrices, and then walks through the training corpus using gradient descent to move T T T and C C C so as to maximize the objective L ( θ ) L(\theta) L(θ). Thus the matrices T T T and C C C function as the parameters θ \theta θ that logistic regression is tuning.

Once the embeddings are learned, we’ll have two embeddings for each word w i w_i wi: t i t_i ti and c i c_i ci. We can choose to throw away the C C C matrix and just keep T T T, in which case each word w i w_i wi will be represented by the vector t i t_i ti.

Alternatively we can add the two embeddings together, using the summed embedding t i + c i t_i + c_i ti+ci as the new d-dimensional embedding, or we can concatenate them into an embedding of dimensionality 2 d 2d 2d.

As with the simple count-based methods like tf-idf, the context window size L L L effects the performance of skip-gram embeddings, and experiments often tune the parameter L L L on a dev set. One difference from the count-based methods is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighboring words must be predicted).

6.9 Visualizing Embeddings

The simplest way to visualize the meaning of a word w w w embedded in a space is to list the most similar words to w w w sorting all words in the vocabulary by their cosines.

Yet another visualization method is to use a clustering algorithm to show a hierarchical representation of which words are similar to others in the embedding space.

Probably the most common visualization method, however, is to project the 100 dimensions of a word
down into 2 dimensions, using a projection method called tSNE (van der Maaten and Hinton, 2008).

6.10 Semantic properties of embeddings

Vector semantic models have a number of parameters. One parameter that is relevant to both sparse tf-idf vectors and dense word2vec vectors is the size of the context window used to collect counts. This is generally between 1 and 10 words on each side of the target word (for a total context of 2-20 words).

The choice depends on on the goals of the representation. Shorter context windows tend to lead to representations that are a bit more syntactic, since the information is coming from immediately nearby words. When the vectors are computed from short context windows, the most similar words to a target word w w w tend to be semantically similar words with the same parts of speech. When vectors are computed from long context windows, the highest cosine words to a target word w w w tend to be words that are topically related but not similar.

It’s also often useful to distinguish two kinds of similarity or association between words (Schutze and Pedersen, 1993). Two words have first-order co-occurrence (sometimes called syntagmatic association) if they are typically nearby each other. Thus wrote is a first-order associate of book or poem. Two words have second-order co-occurrence (sometimes called paradigmatic association) if they have similar neighbors. Thus wrote is a second-order associate of words like said or remarked.

Analogy Another semantic property of embeddings is their ability to capture relational meanings. Mikolov et al. (2013b) and Levy and Goldberg (2014b) show that the offsets between vector embeddings can capture some analogical relations between words. For example, the result of the expression vector(‘king’) - vector(‘man’) + vector(‘woman’) is a vector close to vector(‘queen’). Similarly, they found that the expression vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’).

Embeddings and Historical Semantics: Embeddings can also be a useful tool for studying how meaning changes over time, by computing multiple embedding spaces, each from texts written in a particular time period.

6.11 Bias and Embeddings

6.12 Evaluating Vector Models

For intrinsic evaluations, the most common metric is to test their performance on similarity, computing the correlation between an algorithm’s word similarity scores and word similarity ratings assigned by humans. WordSim-353 (Finkelstein et al., 2002) is a commonly used set of ratings from 0 to 10 for 353 noun pairs; for example (plane, car) had an average score of 5.77. SimLex-999 (Hill et al., 2015) is a more difficult dataset that quantifies similarity (cup, mug) rather than relatedness (cup, coffee), and including both concrete and abstract adjective, noun and verb pairs. The TOEFL dataset is a set of 80 questions, each consisting of a target word with 4 additional word choices; the task is to choose which is the correct synonym, as in the example: Levied is closest in meaning to: imposed, believed, requested, correlated (Landauer and Dumais, 1997). All of these datasets present words without context.

Slightly more realistic are intrinsic similarity tasks that include context. The Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) offers a richer evaluation scenario, giving human judgments on 2,003 pairs of words in their sentential context, including nouns, verbs, and adjectives. This dataset enables the evaluation of word similarity algorithms that can make use of context words. The semantic textual similarity task (Agirre et al. 2012, Agirre et al. 2015) evaluates the performance of sentence-level similarity algorithms, consisting of a set of pairs of sentences, each pair with human-labeled similarity scores.

Another task used for evaluate is an analogy task, where the system has to solve problems of the form a is to b as c is to d, given a, b, and c and having to find d. Thus given Athens is to Greece as Oslo is to , the system must fill in the word Norway. Or more syntactically-oriented examples: given mouse, mice, and dollar the system must return dollars. Large sets of such tuples have been created (Mikolov et al. 2013, Mikolov et al. 2013b).

6.13 Summary

  • In vector semantics, a word is modeled as a vector—a point in high-dimensional space, also called an embedding.
  • Vector semantic models fall into two classes: sparse and dense. In sparse models like tf-idf each dimension corresponds to a word in the vocabulary V V V ;
  • Cell in sparse models are functions of co-occurrence counts. The term-document matrix has rows for each word (term) in the vocabulary and a column for each document.
  • The word-context matrix has a row for each (target) word in the vocabulary and a column for each context term in the vocabulary.
  • A common sparse weighting is tf-idf, which weights each cell by its term frequency and inverse document frequency.
  • Word and document similarity is computed by computing the dot product between vectors. The cosine of two vectors—a normalized dot product—is the most popular such metric.
  • PPMI (pointwise positive mutual information) is an alternative weighting scheme to tf-idf.
  • Dense vector models have dimensionality 50-300 and the dimensions are harder to interpret.
  • The word2vec family of models, including skip-gram and CBOW, is a popular efficient way to compute dense embeddings.
  • Skip-gram trains a logistic regression classifier to compute the probability that two words are ‘likely to occur nearby in text’. This probability is computed from the dot product between the embeddings for the two words,
  • Skip-gram uses stochastic gradient descent to train the classifier, by learning embeddings that have a high dot-product with embeddings of words that occur nearby and a low dot-product with noise words.
  • Other important embedding algorithms include GloVe, a method based on ratios of word co-occurrence probabilities, and fasttext, an open-source library for computing word embeddings by summing embeddings of the bag of character n-grams that make up a word.

你可能感兴趣的:(Speech,and,Language,Processing,ed3)