Following are some excerpts from the paper Skip-Thought Vectors by Ryan Kiros et al. from University of Toronto. Those excerpts summarize the main idea of the paper. The details require typing lots of mathematical formulas, thus are omitted.
Paper name: Skip-Thought Vectors
Paper published time: 2015
Paper authors: Ryan Kiros et al.
Key words: Distributed Representation, Recurrent Neural Networks,Sentence Vector Representation, semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification
Overview
In recent years, several approaches have been developed for learning composition operators that map word vectors to sentence vectors including recursive networks, recurrent networks, convolutional networks and recursive-convolutional methods among others. All of these methods produce sentence representations that are passed to a supervised task and depend on a class label in order to backpropagate through the composition weights. Consequently, these methods learn high quality sentence representations but are tuned only for their respective task.
In this paper, the authors consider the following question: is there a task and a corresponding loss that will allow us to learn highly generic sentence representations? The authors' answer to this question is the model proposed in this paper, which is called skip-thoughts and vectors induced by this model are called skip-thought vectors.
The basic idea of the model is that, instead of using a word to predict its surrounding context, we instead encode a sentence to predict the sentences around it. Thus, any composition operator can be substituted as a sentence encoder and only the objective function becomes modified.
Approach
Inducing skip-thought vectors
The paper treated skip-thoughts in the framework of encoder-decoder models. That is, an encoder maps words to a sentence vector and a decoder is used to generate the surrounding sentences.
In the model, the authors used an RNN encoder with GRU activations and an RNN decoder with a conditional GRU. While RNNs are used, any encoder and decoder can be used so long as we can backpropagate through it.
After training, sentences that share semantic and syntactic properties are thus mapped to similar vector representations.
Vocabulary expansion
The paper also introduced a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words.
Suppose we have a model that was trained to induce word representations, and we also assume that the vocabulary of this word embedding space is much larger than that of the Skip-Thought model we trained. Then, we train a linear mapping between the translation of word spaces. Thus, any word from word embedding space can now be mapped into Skip-Thought word space for encoding sentences.
An alternative strategy is to avoid words altogether and train at the character level.
Experiments Setup
In the experiments, the authors evaluated the capability of the encoder as a generic feature extractor after training on the BookCorpus dataset. The experimentation setup on each task is as follows:
Using the learned encoder as a feature extractor, extract skip-thought vectors for all sentences.
If the task involves computing scores between pairs of sentences, compute component-wise features between pairs.
Train a linear classifier on top of the extracted features, with no additional fine-tuning or backpropagation through the skip-thoughts model.
Conclusion
The authors evaluated the effectiveness of skip-thought vectors as an off-the-shelf sentence representation with linear classifiers across 8 tasks. Many of the methods they compared against were only evaluated on 1 task. The fact that skip-thought vectors perform well on all tasks considered highlight the robustness of the representations.
Many variations have yet to be explored, including:
- deep encoders and decoders,
- larger context windows,
- encoding and decoding paragraphs,
- other encoders, such as convnets.
It is likely the case that more exploration of this space will result in even higher quality representations.