Word embedding techniques are methods used to represent words in a numerical format, such as a vector, that can be input into a machine learning model. These embeddings capture the meaning of the words and the relationships between them in a continuous, dense, and low-dimensional vector space.
Some of the most popular word embedding techniques include:
Word2Vec is a technique that uses a shallow neural network to learn the embeddings of words. It is based on the idea that words that occur in similar contexts have similar meanings.
There are two main architectures for Word2Vec: the Continuous Bag of Words (CBOW) model and the Skip-Gram model.
CBOW: The CBOW model tries to predict a target word (the center word) based on the context words (the surrounding words) in a given window size. The input to the model is the one-hot encoded representation of the context words and the output is the one-hot encoded representation of the target word. The learned weights of the input layer are used as the embedding for each context word.
Skip-Gram: The Skip-Gram model is the opposite of the CBOW model. It tries to predict the context words based on the target word. The input to the model is the one-hot encoded representation of the target word and the output is the one-hot encoded representation of the context words. The learned weights of the output layer are used as the embedding for the target word.
For example, let's say we have the sentence: "The cat sat on the mat."
Using the CBOW model with a window size of 2, we would try to predict the center word "sat" based on the context words "The" and "cat" on one side and "on" and "the" on the other side.
Using the Skip-Gram model with a window size of 2, we would try to predict the context words "The" and "cat" based on the target word "sat".
In both cases, the embeddings learned by the model will represent the meaning of each word in the sentence in a numerical format, and these embeddings can be used as input for other machine learning models or for further analysis.
GloVe (Global Vectors for Word Representation) is a word embedding technique that uses a co-occurrence matrix to represent the relationships between words and then factorizes the matrix to obtain the embeddings.
Here's an example of how GloVe works:
First, a co-occurrence matrix is created by counting the number of times each word appears in a given context (e.g., within a certain window size of another word). This matrix is a symmetric matrix where the rows and columns represent words and the entries represent the number of times the corresponding words co-occur.
Next, the matrix is factorized using a technique such as singular value decomposition (SVD) or non-negative matrix factorization (NMF) to obtain the embeddings for each word. The resulting matrix represents the words in a continuous, dense, and low-dimensional vector space.
For example, let's say we have a corpus of text containing the following sentences:
We create a co-occurrence matrix with words as rows and columns and count the number of times each word appears in a given context (e.g. within a window size of 2).
The matrix would look like this:
|The | cat | sat | on | mat | dog | barked | at
---|----|-----|----|----|----|----|-------|---
The| | | | | | | |
cat | | | | | | 1 | |
sat | | | | | | | |
on | | | | | | | |
mat | | | | | | | |
dog | 1 | | | | | | |
barked| | | | | | | |
at | | | | | | | |
Then, we use a factorization method such as SVD or NMF to obtain the embeddings for each word. The resulting embeddings would represent the words in a continuous, dense, and low-dimensional vector space.
GloVe is a powerful technique that can be used to obtain high-quality word embeddings and has been used in various natural language processing (NLP) tasks such as language translation, text classification, and speech recognition.
FastText is a word embedding technique that is an extension of the Skip-Gram model in Word2Vec, but it takes into account subwords, or character n-grams, to obtain embeddings.
Here's an example of how FastText works:
First, the text corpus is tokenized into individual words, and then a set of character n-grams is extracted from each word. For example, the word "cat" would be represented by the character n-grams ["c", "ca", "cat"].
Next, a skip-gram model is trained on these character n-grams, similar to the way a skip-gram model is trained on words in Word2Vec. The learned weights of the output layer are used as the embedding for the character n-grams.
Finally, the embeddings for the character n-grams are averaged to obtain the embedding for the original word.
For example, let's say we have the word "cat" and we want to obtain its embedding using FastText with n-grams of size 3.
First, we extract the character n-grams ["c", "ca", "cat"].
Next, we train a skip-gram model on these character n-grams and obtain the embeddings for each n-gram.
Finally, we average the embeddings for the character n-grams ["c", "ca", "cat"] to obtain the embedding for the original word "cat".
FastText takes into account the subword information, which can be useful for handling rare and out-of-vocabulary words, that is words that are not in the training set. This property makes it particularly useful for tasks such as text classification and language identification.
FastText is an efficient approach and can be used to learn word embeddings from very large text corpora quickly and it can be trained on billions of words in a matter of minutes.
ELMO (Embeddings from Language Models) is a deep contextualized word representation technique that models both complex characteristics of word use (e.g., syntax and semantics), and how these uses vary across linguistic contexts (i.e, to model polysemy).
ELMO uses a deep bidirectional language model (biLM), which is trained on a large corpus of text data. The biLM takes a sequence of words as input, and produces a fixed-dimensional representation for each word, called ELMO embeddings. These embeddings are then fine-tuned for specific NLP tasks such as named entity recognition, question answering, and sentiment analysis.
Here's an example of how ELMO works:
First, the text corpus is tokenized into individual words and a deep bidirectional language model (biLM) is trained on it.
Next, the biLM is used to obtain the embeddings for each word in a given sentence. The embeddings are obtained by concatenating the output of the biLM's two layers, which are the representations of a word in different granularities, such as character and subword.
Finally, these embeddings are fine-tuned for a specific NLP task using a task-specific model.
For example, let's say we have the sentence "The cat sat on the mat." and we want to obtain the ELMO embeddings for each word in the sentence.
First, we feed the sentence to the pre-trained biLM and obtain the embeddings for each word in the sentence.
Next, we fine-tune these embeddings for a specific NLP task, for example, named entity recognition.
ELMO is trained on a large corpus of text data, so it captures a wide range of linguistic features and is able to handle polysemous words, which are words that have multiple meanings depending on the context. This property makes ELMO particularly useful for tasks such as named entity recognition and question answering, where the meaning of a word can change depending on the context.