Sebastian Ruder
6 Mar, 2018
Research
Research directions at AYLIEN in NLP and transfer learning
12 mins read
In a recent blog post I outlined some interesting research directions for people who are just getting into NLP and ML (you can read the original post here). These ideas can also serve as a starting point and source of inspiration if you are considering starting a PhD in this field, which is why I’m highlighting them again here. At AYLIEN, we offer the opportunity for people to join our research team and complete a PhD, while also working on the application of that research directly as part of our products. If this sounds interesting, you should consider applying to work with us on these (or related) ideas. See this blog post for more information.
Table of contents:
Task-independent data augmentation for NLP
Few-shot learning for NLP
Transfer learning for NLP
Multi-task learning
Cross-lingual learning
Task-independent architecture improvements
It can be hard to find compelling topics to work on and know what questions are interesting to ask when you are just starting as a researcher in a new field. Machine learning research in particular moves so fast these days that it is difficult to find an opening.
This post aims to provide inspiration and ideas for research directions to junior researchers and those trying to get into research. It gathers a collection of research topics that are interesting to me, with a focus on NLP and transfer learning. As such, they might obviously not be of interest to everyone. If you are interested in Reinforcement Learning, OpenAI provides a selection of interesting RL-focused research topics. In case you’d like to collaborate with others or are interested in a broader range of topics, have a look at the Artificial Intelligence Open Network.
Most of these topics are not thoroughly thought out yet; in many cases, the general description is quite vague and subjective and many directions are possible. In addition, most of these are not low-hanging fruit, so serious effort is necessary to come up with a solution. I am happy to provide feedback with regard to any of these, but will not have time to provide more detailed guidance unless you have a working proof-of-concept. I will update this post periodically with new research directions and advances in already listed ones. Note that this collection does not attempt to review the extensive literature but only aims to give a glimpse of a topic; consequently, the references won’t be comprehensive.
I hope that this collection will peak your interest and serve as inspiration for your own research agenda.
Task-independent data augmentation for NLP
Data augmentation aims to create additional training data by producing variations of existing training examples through transformations, which can mirror those encountered in the real world. In Computer Vision (CV), common augmentation techniques are mirroring, random cropping, shearing, etc. Data augmentation is super useful in CV. For instance, it has been used to great effect in AlexNet (Krizhevsky et al., 2012) [1] to combat overfitting and in most state-of-the-art models since. In addition, data augmentation makes intuitive sense as it makes the training data more diverse and should thus increase a model’s generalization ability.
However, in NLP, data augmentation is not widely used. In my mind, this is for two reasons:
Data in NLP is discrete. This prevents us from applying simple transformations directly to the input data. Most recently proposed augmentation methods in CV focus on such transformations, e.g. domain randomization (Tobin et al., 2017) [2].
Small perturbations may change the meaning. Deleting a negation may change a sentence’s sentiment, while modifying a word in a paragraph might inadvertently change the answer to a question about that paragraph. This is not the case in CV where perturbing individual pixels does not change whether an image is a cat or dog and even stark changes such as interpolation of different images can be useful (Zhang et al., 2017) [3].
Existing approaches that I am aware of are either rule-based (Li et al., 2017) [5] or task-specific, e.g. for parsing (Wang and Eisner, 2016) [6] or zero-pronoun resolution (Liu et al., 2017) [7]. Xie et al. (2017) [39] replace words with samples from different distributions for language modelling and Machine Translation. Recent work focuses on creating adversarial examples either by replacing words or characters (Samanta and Mehta, 2017; Ebrahimi et al., 2017) [8, 9], concatenation (Jia and Liang, 2017) [11], or adding adversarial perturbations (Yasunaga et al., 2017) [10]. An adversarial setup is also used by Li et al. (2017) [16] who train a system to produce sequences that are indistinguishable from human-generated dialogue utterances.
Back-translation (Sennrich et al., 2015; Sennrich et al., 2016) [12, 13] is a common data augmentation method in Machine Translation (MT) that allows us to incorporate monolingual training data. For instance, when training a EN→FR system, monolingual French text is translated to English using an FR→EN system; the synthetic parallel data can then be used for training. Back-translation can also be used for paraphrasing (Mallinson et al., 2017) [14]. Paraphrasing has been used for data augmentation for QA (Dong et al., 2017) [15], but I am not aware of its use for other tasks.
Another method that is close to paraphrasing is generating sentences from a continuous space using a variational autoencoder (Bowman et al., 2016; Guu et al., 2017) [17, 19]. If the representations are disentangled as in (Hu et al., 2017) [18], then we are also not too far from style transfer (Shen et al., 2017) [20].
There are a few research directions that would be interesting to pursue:
Evaluation study: Evaluate a range of existing data augmentation methods as well as techniques that have not been widely used for augmentation such as paraphrasing and style transfer on a diverse range of tasks including text classification and sequence labelling. Identify what types of data augmentation are robust across task and which are task-specific. This could be packaged as a software library to make future benchmarking easier (think CleverHans for NLP).
Data augmentation with style transfer: Investigate if style transfer can be used to modify various attributes of training examples for more robust learning.
Learn the augmentation: Similar to Dong et al. (2017) we could learn either to paraphrase or to generate transformations for a particular task.
Learn a word embedding space for data augmentation: A typical word embedding space clusters synonyms and antonyms together; using nearest neighbours in this space for replacement is thus infeasible. Inspired by recent work (Mrkšić et al., 2017) [21], we could specialize the word embedding space to make it more suitable for data augmentation.
Adversarial data augmentation: Related to recent work in interpretability (Ribeiro et al., 2016) [22], we could change the most salient words in an example, i.e. those that a model depends on for a prediction. This still requires a semantics-preserving replacement method, however.
Few-shot learning for NLP
Zero-shot, one-shot and few-shot learning are one of the most interesting recent research directions IMO. Following the key insight from Vinyals et al. (2016) [4] that a few-shot learning model should be explicitly trained to perform few-shot learning, we have seen several recent advances (Ravi and Larochelle, 2017; Snell et al., 2017) [23, 24].
Learning from few labeled samples is one of the hardest problems IMO and one of the core capabilities that separates the current generation of ML models from more generally applicable systems. Zero-shot learning has only been investigated in the context of learning word embeddings for unknown words AFAIK. Dataless classification (Song and Roth, 2014; Song et al., 2016) [25, 26] is an interesting related direction that embeds labels and documents in a joint space, but requires interpretable labels with good descriptions.
Potential research directions are the following:
Standardized benchmarks: Create standardized benchmarks for few-shot learning for NLP. Vinyals et al. (2016) introduce a one-shot language modelling task for the Penn Treebank. The task, while useful, is dwarfed by the extensive evaluation on CV benchmarks and has not seen much use AFAIK. A few-shot learning benchmark for NLP should contain a large number of classes and provide a standardized split for reproducibility. Good candidate tasks would be topic classification or fine-grained entity recognition.
Evaluation study: After creating such a benchmark, the next step would be to evaluate how well existing few-shot learning models from CV perform for NLP.
Novel methods for NLP: Given a dataset for benchmarking and an empirical evaluation study, we could then start developing novel methods that can perform few-shot learning for NLP.
Transfer learning for NLP
Transfer learning has had a large impact on computer vision (CV) and has greatly lowered the entry threshold for people wanting to apply CV algorithms to their own problems. CV practicioners are no longer required to perform extensive feature-engineering for every new task, but can simply fine-tune a model pretrained on a large dataset with a small number of examples.
In NLP, however, we have so far only been pretraining the first layer of our models via pretrained embeddings. Recent approaches (Peters et al., 2017, 2018) [31, 32] add pretrained language model embedddings, but these still require custom architectures for every task. In my opinion, in order to unlock the true potential of transfer learning for NLP, we need to pretrain the entire model and fine-tune it on the target task, akin to fine-tuning ImageNet models. Language modelling, for instance, is a great task for pretraining and could be to NLP what ImageNet classification is to CV (Howard and Ruder, 2018) [33].
Here are some potential research directions in this context:
Identify useful pretraining tasks: The choice of the pretraining task is very important as even fine-tuning a model on a related task might only provide limited success (Mou et al., 2016) [38]. Other tasks such as those explored in recent work on learning general-purpose sentence embeddings (Conneau et al., 2017; Subramanian et al., 2018; Nie et al., 2017) [34, 35, 40] might be complementary to language model pretraining or suitable for other target tasks.
Fine-tuning of complex architectures: Pretraining is most useful when a model can be applied to many target tasks. However, it is still unclear how to pretrain more complex architectures, such as those used for pairwise classification tasks (Augenstein et al., 2018) or reasoning tasks such as QA or reading comprehension.
Multi-task learning
Multi-task learning (MTL) has become more commonly used in NLP. See herefor a general overview of multi-task learning and here for MTL objectives for NLP. However, there is still much we don’t understand about multi-task learning in general.
The main questions regarding MTL give rise to many interesting research directions:
Identify effective auxiliary tasks: One of the main questions is which tasks are useful for multi-task learning. Label entropy has been shown to be a predictor of MTL success (Alonso and Plank, 2017) [28], but this does not tell the whole story. In recent work (Augenstein et al., 2018) [27], we have found that auxiliary tasks with more data and more fine-grained labels are more useful. It would be useful if future MTL papers would not only propose a new model or auxiliary task, but also try to understand why a certain auxiliary task might be better than another closely related one.
Alternatives to hard parameter sharing: Hard parameter sharing is still the default modus operandi for MTL, but places a strong constraint on the model to compress knowledge pertaining to different tasks with the same parameters, which often makes learning difficult. We need better ways of doing MTL that are easy to use and work reliably across many tasks. Recently proposed methods such as cross-stitch units (Misra et al., 2017; Ruder et al., 2017) [29, 30] and a label embedding layer (Augenstein et al., 2018) are promising steps in this direction.
Artificial auxiliary tasks: The best auxiliary tasks are those, which are tailored to the target task and do not require any additional data. I have outlined a list of potential artificial auxiliary tasks here. However, it is not clear which of these work reliably across a number of diverse tasks or what variations or task-specific modifications are useful.
Cross-lingual learning
Creating models that perform well across languages and that can transfer knowledge from resource-rich to resource-poor languages is one of the most important research directions IMO. There has been much progress in learning cross-lingual representations that project different languages into a shared embedding space. Refer to Ruder et al. (2017) [36] for a survey.
Cross-lingual representations are commonly evaluated either intrinsically on similarity benchmarks or extrinsically on downstream tasks, such as text classification. While recent methods have advanced the state-of-the-art for many of these settings, we do not have a good understanding of the tasks or languages for which these methods fail and how to mitigate these failures in a task-independent manner, e.g. by injecting task-specific constraints (Mrkšić et al., 2017).
Task-independent architecture improvements
Novel architectures that outperform the current state-of-the-art and are tailored to specific tasks are regularly introduced, superseding the previous architecture. I have outlined best practices for different NLP tasks before, but without comparing such architectures on different tasks, it is often hard to gain insights from specialized architectures and tell which components would also be useful in other settings.
A particularly promising recent model is the Transformer (Vaswani et al., 2017) [37]. While the complete model might not be appropriate for every task, components such as multi-head attention or position-based encoding could be building blocks that are generally useful for many NLP tasks.
Conclusion
I hope you’ve found this collection of research directions useful. If you have suggestions on how to tackle some of these problems or ideas for related research topics, feel free to comment below.
References
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. arXiv Preprint arXiv:1703.06907. Retrieved from http://arxiv.org/abs/1703.06907
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond Empirical Risk Minimization, 1–11. Retrieved from http://arxiv.org/abs/1710.09412
Vinyals, O., Blundell, C., Lillicrap, T., Kavukcuoglu, K., & Wierstra, D. (2016). Matching Networks for One Shot Learning. NIPS 2016. Retrieved from http://arxiv.org/abs/1606.04080
Li, Y., Cohn, T., & Baldwin, T. (2017). Robust Training under Linguistic Adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 2, pp. 21–27).
Wang, D., & Eisner, J. (2016). The Galactic Dependencies Treebanks: Getting More Data by Synthesizing New Languages. Tacl, 4, 491–505. Retrieved from https://www.transacl.org/ojs/index.php/tacl/article/viewFile/917/212%0Ahttps://transacl.org/ojs/index.php/tacl/article/view/917
Liu, T., Cui, Y., Yin, Q., Zhang, W., Wang, S., & Hu, G. (2017). Generating and Exploiting Large-scale Pseudo Training Data for Zero Pronoun Resolution. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 102–111).
Samanta, S., & Mehta, S. (2017). Towards Crafting Text Adversarial Samples. arXiv preprint arXiv:1707.02812.
Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2017). HotFlip: White-Box Adversarial Examples for NLP. Retrieved from http://arxiv.org/abs/1712.06751
Yasunaga, M., Kasai, J., & Radev, D. (2017). Robust Multilingual Part-of-Speech Tagging via Adversarial Training. In Proceedings of NAACL 2018. Retrieved from http://arxiv.org/abs/1711.04903
Jia, R., & Liang, P. (2017). Adversarial Examples for Evaluating Reading Comprehension Systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.
Sennrich, R., Haddow, B., & Birch, A. (2015). Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709.
Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891.
Mallinson, J., Sennrich, R., & Lapata, M. (2017). Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (Vol. 1, pp. 881-893).
Dong, L., Mallinson, J., Reddy, S., & Lapata, M. (2017). Learning to Paraphrase for Question Answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.
Li, J., Monroe, W., Shi, T., Ritter, A., & Jurafsky, D. (2017). Adversarial Learning for Neural Dialogue Generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Retrieved from http://arxiv.org/abs/1701.06547
Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Jozefowicz, R., & Bengio, S. (2016). Generating Sentences from a Continuous Space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL). Retrieved from http://arxiv.org/abs/1511.06349
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., & Xing, E. P. (2017). Toward Controlled Generation of Text. In Proceedings of the 34th International Conference on Machine Learning.
Guu, K., Hashimoto, T. B., Oren, Y., & Liang, P. (2017). Generating Sentences by Editing Prototypes.
Shen, T., Lei, T., Barzilay, R., & Jaakkola, T. (2017). Style Transfer from Non-Parallel Text by Cross-Alignment. In Advances in Neural Information Processing Systems. Retrieved from http://arxiv.org/abs/1705.09655
Mrkšić, N., Vulić, I., Séaghdha, D. Ó., Leviant, I., Reichart, R., Gašić, M., … Young, S. (2017). Semantic Specialisation of Distributional Word Vector Spaces using Monolingual and Cross-Lingual Constraints. TACL. Retrieved from http://arxiv.org/abs/1706.00374
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016, August). Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1135-1144). ACM.
Ravi, S., & Larochelle, H. (2017). Optimization as a Model for Few-Shot Learning. In ICLR 2017.
Snell, J., Swersky, K., & Zemel, R. S. (2017). Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems.
Song, Y., & Roth, D. (2014). On dataless hierarchical text classification. Proceedings of AAAI, 1579–1585. Retrieved from http://cogcomp.cs.illinois.edu/papers/SongSoRo14.pdf
Song, Y., Upadhyay, S., Peng, H., & Roth, D. (2016). Cross-Lingual Dataless Classification for Many Languages. Ijcai, 2901–2907.
Augenstein, I., Ruder, S., & Søgaard, A. (2018). Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces. In Proceedings of NAACL 2018.
Alonso, H. M., & Plank, B. (2017). When is multitask learning effective? Multitask learning for semantic sequence prediction under varying data conditions. In EACL. Retrieved from http://arxiv.org/abs/1612.02251
Misra, I., Shrivastava, A., Gupta, A., & Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. http://doi.org/10.1109/CVPR.2016.433
Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks. arXiv preprint arXiv:1705.08142.
Peters, M. E., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017).
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL.
Howard, J., & Ruder, S. (2018). Fine-tuned Language Models for Text Classification. arXiv preprint arXiv:1801.06146.
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.
Subramanian, S., Trischler, A., Bengio, Y., & Pal, C. J. (2018). Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning. In Proceedings of ICLR 2018.
Ruder, S., Vulić, I., & Søgaard, A. (2017). A Survey of Cross-lingual Word Embedding Models. arXiv Preprint arXiv:1706.04902. Retrieved from http://arxiv.org/abs/1706.04902
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.
Mou, L., Meng, Z., Yan, R., Li, G., Xu, Y., Zhang, L., & Jin, Z. (2016). How Transferable are Neural Networks in NLP Applications? Proceedings of 2016 Conference on Empirical Methods in Natural Language Processing.
Xie, Z., Wang, S. I., Li, J., Levy, D., Nie, A., Jurafsky, D., & Ng, A. Y. (2017). Data Noising as Smoothing in Neural Network Language Models. In Proceedings of ICLR 2017.
Nie, A., Bennett, E. D., & Goodman, N. D. (2017). DisSent: Sentence Representation Learning from Explicit Discourse Relations. arXiv Preprint arXiv:1710.04334. Retrieved from http://arxiv.org/abs/1710.04334