We want to represent a word with a vector in NLP. There are many methods.
Represent every word as an ℝ|V|∗1 vector with all 0s and one 1 at the index of that word in the sorted english language. Where V is the set of vocabularies.
Representing a word by means of its neighbors.
In this method we count the number of times each word appears inside a window of a particular size around the word of interest.
For example:
The matrix is too large. We should make it smaller with SVD.
SVD based methods do not scale well for big matrices and it is hard to incorporate new words or documents. Computational cost for a m∗n matrix is O(mn2)
We need to create such a model that will assign a probability to a sequence of tokens.
For example
* The cat jumped over the puddle. —high probability
* Stock boil fish is toy. —low probability
Unigrams:
We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent:
However, we know the next word is highly contingent upon the previous sequence of words. This model is bad.
Bigrams:
We let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it.
Example Sequence:
“The cat jumped over the puddle.”
What is Continuous Bag of words Model?
We treat {“the”, “cat” , “over”, “puddle”} as a context. And the word “jumped” is the center word. Context should be able to predict the center world. This type of model we call a Continuous Bag of words Model.
Known parameters:
If the index of center word is c , then the indexes of context are c−m,...,c−1,c+1,...,c+m .
The input of the model is the one-hot vector of context. We represent it with x(c−m)...x(c−1),x(c+1)...x(c+m) .
And the outputs is the one-hot vector of center word.We represent it with y(c) .
Parameters we need to learn:
∈ℝn∗|V| : Input word matrix
vi : i-th column of vi , the input vector representation of word wi
∈ℝ|V|∗n : Output word matrix
ui : i-th row of ui , the output vector representation of word wi
Where n is an arbitrary size which defines the size of our embedding space.
How does it work:
1. We get our embedded word vectors for the context:
How to learn , :
learn them with stochastic gradient descent. So we need a loss function.
We use cross-entropy to measure the distance between two distributions:
We use stochastic gradient descent to update , .