Word Vectors(1)

We want to represent a word with a vector in NLP. There are many methods.

1 one-hot Vector

Represent every word as an |V|1 vector with all 0s and one 1 at the index of that word in the sorted english language. Where V is the set of vocabularies.

2 SVD Based Methods

2.1 Window based Co-occurrence Matrix

Representing a word by means of its neighbors.
In this method we count the number of times each word appears inside a window of a particular size around the word of interest.

For example:

The matrix is too large. We should make it smaller with SVD.

  1. Generate |V||V| co-occurrence matrix, X .
  2. Apply SVD on X to get X=USVT .
  3. Select the first k columns of U to get a k -dimensional word vectors.
  4. ki=1σi|V|i=1σi indicates the amount of variance captured by the first k dimensions.

2.2 shortage

SVD based methods do not scale well for big matrices and it is hard to incorporate new words or documents. Computational cost for a mn matrix is O(mn2)

3 Iteration Based Methods - Word2Vec

3.1 Language Models (Unigrams, Bigrams, etc.)

We need to create such a model that will assign a probability to a sequence of tokens.

For example
* The cat jumped over the puddle. —high probability
* Stock boil fish is toy. —low probability

Unigrams:
We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent:

P(w1,w2,...,wn)=i=1nP(wi)

However, we know the next word is highly contingent upon the previous sequence of words. This model is bad.

Bigrams:
We let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it.

P(w1,w2,...,wn)=i=2nP(wi|wi1)

3.2 Continuous Bag of words Model (CBOW)

Example Sequence:
The cat jumped over the puddle.

What is Continuous Bag of words Model?
We treat {“the”, “cat” , “over”, “puddle”} as a context. And the word “jumped” is the center word. Context should be able to predict the center world. This type of model we call a Continuous Bag of words Model.

Known parameters:
If the index of center word is c , then the indexes of context are cm,...,c1,c+1,...,c+m .
The input of the model is the one-hot vector of context. We represent it with x(cm)...x(c1),x(c+1)...x(c+m) .
And the outputs is the one-hot vector of center word.We represent it with y(c) .

Parameters we need to learn:
n|V| : Input word matrix
vi : i-th column of vi , the input vector representation of word wi
|V|n : Output word matrix
ui : i-th row of ui , the output vector representation of word wi
Where n is an arbitrary size which defines the size of our embedding space.

How does it work:
1. We get our embedded word vectors for the context:

vcm=xcm,vcm+1=xcm+1,...

2. Average these vectors:
vˆ=vcm+vcm+1+...2m

3. Generate a score vector z=vˆ . As the dot product of similar vectors is higher, it will push similar words close to each other in order to achieve a high score.
4. Turn the scores into probabilities yˆ=softmax(z)|V|
5. We desire our probabilities generated yˆ to match the true probabilities y(c) .

How to learn , :
learn them with stochastic gradient descent. So we need a loss function.
We use cross-entropy to measure the distance between two distributions:

H(yˆ,y)=i=1|V|yilog(yi^)

Consider yˆ is a one-hot vector. Simplifies to simply:
H(yˆ,y)=yilog(yi^)=log(yi^)

We formulate our optimization objective as:
minimize J=logP(wc|wcm,...,wc+m)=logP(uc|vˆ)=logexp(uTcvˆ)|V|j=1exp(uTjvˆ)=uTcvˆ+logj=1|V|exp(uTjvˆ)

We use stochastic gradient descent to update , .

你可能感兴趣的:(Word Vectors(1))