lstm预测单词_下一个单词预测完整指南

lstm预测单词

As part of my summer internship with Linagora’s R&D team, I was tasked with developing a next word prediction and autocomplete system akin to that of Google’s Smart Compose. In this article, we look at everything necessary to get started with next word prediction.

在我与Linagora的研发团队进行暑期实习的过程中,我的任务是开发类似于Google Smart Compose的下一个单词预测和自动完成系统。 在本文中,我们研究了开始下一个单词预测所需的一切。

Email continues to be one of the largest forms of communication, professionally as well as personally. It is estimated that almost 3.8 billion users send nearly 300 billion emails a day! Thus, for Linagora’s open source collaborative platform OpenPaaS, which includes an email system, improving user experience is a priority. With real time assisted writing, users are able to craft richer messages from scratch and have a smoother flow as compared to the static counterpart: complete automated replies.

电子邮件仍然是最大的专业和个人通信形式之一。 据估计,近38亿用户每天发送近3000亿封电子邮件! 因此,对于包含电子邮件系统的Linagora开源协作平台OpenPaaS,改善用户体验是当务之急。 借助实时辅助编写功能,与静态副本相比,用户可以从头开始编写更丰富的消息,并且流程更加顺畅:完全自动答复。

A real-time assisted writing system

实时辅助写作系统

The general pipeline of an assisted writing system relies on an accurate and fast next word prediction model. It is crucial to consider several problems in regards to building an industrial language model that enhances user experience: time on inference, model compression, transfer learning to further personalise suggestions.

辅助书写系统的总体流程依赖于准确而快速的下一个单词预测模型。 在构建可增强用户体验的工业语言模型时,考虑几个问题至关重要:推理时间,模型压缩,转移学习以进一步个性化建议。

At the moment, these issues are addressed by the growing use of Deep Learning techniques for mobile keyboard predictions. Leading companies have already made the switch from statistical n-gram models to machine learning based systems deployed on mobile devices. However, with Deep Learning comes the baggage of extremely large model sizes unfit for on-device prediction. As a result, model compression is key to maintaining accuracy while not using too much space. Furthermore, mobile keyboards are also tasked with learning your personal texting habits in order to make predictions that are suited to your style rather than just a general language model . Let’s take a look at how it’s done!

目前,通过深度学习技术越来越多地用于移动键盘预测来解决这些问题。 领先的公司已经从统计n-gram模型转变为部署在移动设备上的基于机器学习的系统。 但是,随着深度学习的到来,非常大的模型尺寸不适合设备上的预测。 结果,模型压缩是在不占用过多空间的情况下保持准确性的关键。 此外,移动键盘还负责学习您的个人发短信习惯,以便做出适合您的风格的预测,而不仅仅是普通的语言模型。 让我们看看它是如何完成的!

lstm预测单词_下一个单词预测完整指南_第1张图片
Next word predictions in Google’s Gboard Google Gboard中的下一个单词预测

Some useful training corpora

一些有用的训练语料库

In order to evaluate the different model architecture’s I will use to build my language model, it is crucial to have a solid benchmark evaluation set. For this task, I chose the Penn Tree Bank dataset which tests easily if your model is overfitting. Due to the small vocabulary size (roughly 10,000 unique words!), it is imperative to build a model that is well regularised. In addition to this, Enron’s email corpus was used to train on real email data to test predictions in the context of emails (with a much larger and richer vocabulary size of nearly 39,000 unique words).

为了评估不同的模型体系结构,我将用来构建语言模型,拥有一个可靠的基准评估集至关重要。 对于此任务,我选择了Penn Tree Bank数据集,可以轻松测试模型是否过拟合。 由于词汇量较小(大约10,000个唯一的单词!),因此必须构建一个规范化的模型。 除此之外,还使用了安然的电子邮件语料库来训练真实的电子邮件数据,以测试电子邮件上下文中的预测(词汇量更大,更丰富,包含近39,000个唯一单词)。

My task was to build a bilingual model, in French and English, with intensions of generalising to other languages as well. For this, I also considered several widely available French texts. FrWac is a web archive of the whole .fr domain which is a great corpus to train on a diverse set of French sequences. For those with extensive GPU resources, the entire French wikipedia dump is also available online. With my code, I trained on a couple of short stories from Project Gutenberg (another great resource for textual data in multiple languages!)

我的任务是建立法语和英语双语模型,并具有将其推广到其他语言的意图。 为此,我还考虑了几种广泛使用的法语文本。 FrWac是整个.fr域的网络档案,它是训练各种法语序列的强大语料库。 对于拥有大量GPU资源的用户,还可以在线获取整个法语维基百科转储。 通过我的代码,我训练了一些来自Gutenberg项目的短篇故事(另一种很棒的资源,可提供多种语言的文本数据!)

Model Architectures

模型架构

Now comes the fun part: language modelling. Generally speaking, approaches for text generation/prediction can be split into two categories: statistical based and learning based. In this article, we focus on the latter and take a deep dive into several recurrent neural network (RNN) variant architectures. For a next word prediction task, we want to build a word level language model as opposed to a character n-gram based approach however if we’re looking into completing the words along with predicting the next word then we would need to incorporate something known as beam search which relies on a character level approach. For this article, we will stick to a next word prediction without beam search.

现在来了有趣的部分:语言建模。 一般而言,用于文本生成/预测的方法可以分为两类:基于统计的方法和基于学习的方法。 在本文中,我们将重点放在后者上,并深入研究几种递归神经网络(RNN)变体体系结构。 对于下一个单词预测任务,我们想构建一个单词级语言模型,而不是基于字符n-gram的方法,但是,如果我们正在考虑完成这些单词并预测下一个单词,那么我们就需要结合一些已知的知识作为波束搜索,它依赖于字符级方法。 对于本文,我们将坚持不进行波束搜索的下一个单词预测。

At the time of writing this article, the models with the most success in language generation and prediction tasks are transformer based which exploit the idea of self attention. [Description]. However, these models are notoriously difficult to train without large amounts of training data as well as a good amount of GPU/TPU resources. Here we focus on the next best alternative: LSTM models.

在撰写本文时,在语言生成和预测任务方面最成功的模型是基于变压器的,它们利用了自我注意的思想。 [描述]。 然而,众所周知,如果没有大量的训练数据以及大量的GPU / TPU资源,这些模型很难训练。 在这里,我们重点介绍下一个最佳替代方法:LSTM模型。

An LSTM, Long Short Term Memory, model was first introduced in the late 90s by Hochreiter and Schmidhuber. Since then many advancements have been made using LSTM models and its applications are seen from areas including time series analysis to connected handwriting recognition. An LSTM network is a type of RNN which learns dependence on historic data for a sequence prediction task. What allows LSTMs to learn these historic dependencies are its feedback connections. For a common LSTM cell, we usually see an input gate, an output gate, and a forget gate. The weights of these gates control the flow of information in an LSTM model and thus are the parameters learnt during the training process.

Hochreiter和Schmidhuber于90年代初首次提出了LSTM(长期短期记忆)模型。 从那时起,使用LSTM模型取得了许多进步,并且从时间序列分析到连接的手写识别等领域看到了其应用。 LSTM网络是一种RNN,可以学习序列预测任务对历史数据的依赖性。 允许LSTM学习这些历史依存关系的是其反馈连接。 对于一个普通的LSTM单元,我们通常会看到一个输入门,一个输出门和一个忘记门。 这些门的权重控制着LSTM模型中的信息流,因此是训练过程中学习的参数。

Some variants of the LSTM model include a Convolution LSTM (or CNN-LSTM) and a Bidirectional LSTM (or Bi-LSTM). For our task, these variants correspond to different encodings of our input sequence. A CNN-LSTM model is typically used for image captioning tasks where the CNN portion is used to recognise features of the image and the LSTM is used to generate a suitable caption based on the features. Experimenting with using a CNN as the encoding layer yielded some interesting results however with the large number of parameters and even bigger model sizes, it was not fit for this task.

LSTM模型的某些变体包括卷积LSTM(或CNN-LSTM)和双向LSTM(或Bi-LSTM)。 对于我们的任务,这些变体对应于我们输入序列的不同编码。 CNN-LSTM模型通常用于图像字幕任务,其中CNN部分用于识别图像特征,而LSTM用于基于特征生成合适的字幕。 使用CNN作为编码层的实验产生了一些有趣的结果,但是,由于参数数量众多,甚至模型尺寸更大,因此不适合该任务。

lstm预测单词_下一个单词预测完整指南_第2张图片
Top: CNN-LSTM, Bottom: Bi-LSTM (credit to owners) 上:CNN-LSTM,下:Bi-LSTM(归所有者所有)

In “Recurrent Neural Network Regularisation” by Zaremba et al., they discuss a highly regularised LSTM model with added dropouts. This model is well suited for our task by meeting the requirements of both smaller model sizes as well as high performance. It is crucial in this era of Deep Leaning that there is consideration for a tradeoff between model size and performance. Being resource efficient should be the goal rather than improving accuracy by 0.01%. The model discussed in this paper balanced the trade off between complexity and accuracy quite well, as we will discuss in the next section.

在Zaremba等人的“递归神经网络正则化”中,他们讨论了具有增加的缺失的高度正则化的LSTM模型。 通过同时满足较小模型尺寸和高性能的要求,此模型非常适合我们的任务。 在深度学习的时代,至关重要的是要在模型大小和性能之间进行权衡。 目标是提高资源效率,而不是将准确性提高0.01%。 本文讨论的模型很好地平衡了复杂性和准确性之间的平衡,正如我们将在下一节中讨论的那样。

下一个单词预测的选择模型的实现 (Implementation of the Chosen Model for Next Word Prediction)

Below is an implementation of this model using the Deep Learning library, PyTorch. While I have previously implemented this in Tensorflow as well, the model couldn’t converge well therefore I wouldn’t recommend Tensorflow particularly for this task. Although ideally there should be no difference!

以下是使用深度学习库PyTorch对该模型的实现。 虽然我之前也曾在Tensorflow中实现此功能,但该模型无法很好地收敛,因此我不建议为此目的特别推荐Tensorflow。 尽管理想情况下应该没有区别!

Results

结果

So how did our final model perform on different text corpora? In order to evaluate the performance, we considered both qualitative and quantitative metrics. Qualitatively, it’s crucial to look at sentence structure (in cases where there isn’t an exact match) and context of the predictions. In the latter, we should ideally be able to see the benefit and power of LSTM based approached as compared to statistical methods. Quantitatively, we considered word perplexity (exponential of the loss) which allowed us to measure how confidently our model was predicting words, ExactMatch@N which shows at which level of probability (N=1 means the highest, N=2 means the second highest prediction, and so on) are we predicting the right word if at all, lastly we measured precision and accuracy to measure…as the name suggests… the accuracy of our model!

那么我们的最终模型如何在不同的文本语料库上执行? 为了评估性能,我们考虑了定性和定量指标。 定性地,至关重要的是查看句子结构(在没有完全匹配的情况下)和预测的上下文。 在后者中,理想情况下,我们应该能够看到与统计方法相比,基于LSTM的方法的优势和优势。 从数量上讲,我们考虑了世界的困惑度 (损失的指数),这使我们能够衡量模型预测单词的自信程度, ExactMatch @ N表示处于哪个概率水平(N = 1表示最高概率,N = 2表示第二概率最高的预测,依此类推)是我们是否在预测正确的单词,最后,我们测量了精度和要测量的准确性 ……顾名思义,就是我们模型的准确性!

Our regularised LSTM performed quite well on the PTB corpus and managed to converge without overfitting, which was the primary goal. Upon increasing the input vocabulary size, by using a portion of the Enron corpus, we see even better results in terms of word level perplexity. Here are some quantitative results on both the PTB and Enron dataset:

我们的正规化LSTM在PTB语料库上表现良好,并且设法收敛而不会过度拟合,这是主要目标。 通过使用Enron语料库的一部分来增加输入词汇量后,我们在单词级别的困惑方面看到了更好的结果。 以下是PTB和Enron数据集上的一些定量结果:

lstm预测单词_下一个单词预测完整指南_第3张图片
Word level perplexity for Small regularised model (PTB) and for Medium regularised model (Enron) 小型正则化模型(PTB)和中等正则化模型(Enron)的字级困惑

Our model achieved a final test perplexity level of 117.03 on the PTB corpus with a small regularised model and a perplexity of 111.05 on the PTB corpus with a medium regularised model. As the PTB corpus was our benchmark measurement, we proceeded with the medium regularised model to train on the Enron corpus. Here we achieved a final test perplexity of 55.12 which corresponds to a test loss of 3.56 after training for 39 epochs. While in these two cases the convergence problem was solved, the same result was not met for the French text data.

我们的模型使用小正则化模型在PTB语料库上达到了最终测试的困惑度为117.03,而在中等正则化模型上对PTB语料库达到了最终测试困惑度为111.05。 由于PTB语料库是我们的基准测量,因此我们继续使用中等正则化模型来训练Enron语料库。 在这里,我们获得了55.12的最终测试困惑度,相当于训练39个纪元后的测试损失为3.56。 虽然在这两种情况下都解决了收敛问题,但法语文本数据未达到相同的结果。

lstm预测单词_下一个单词预测完整指南_第4张图片
Medium LSTM for French Text 中级LSTM(法语文本)

As you can see above, while the perplexity for the training set is reducing the one for the validation set does not converge. This can be attributed to the training and validation sets having very different vocabulary or to the added complexity in french data with additional conjugations for verbs as compared to English. However, with more training data it is quite likely that the convergence will not be an issue.

如上所示,虽然训练集的困惑度降低了,但验证集的困惑度却没有收敛。 这可以归因于具有非常不同的词汇的训练和验证集,或者归因于法语数据的增加的复杂性以及与英语相比动词的附加变位。 但是,随着训练数据的增加,收敛很有可能不会成为问题。

Some predictions with the Enron corpus 安然语料库的一些预测

Looking at some qualitative results for the model trained on the Enron corpus, we see a respectable accuracy (ExactMatch@1) of nearly 44%. Considering a vocabulary of only 39,000 a 44% accuracy is a great result! This shows that the regularised LSTM model works well for the next word prediction task especially with smaller amounts of training data.

查看在Enron语料库上训练的模型的一些定性结果,我们看到接近44%的可观准确性(ExactMatch @ 1)。 考虑到只有39,000个词汇表,那么44%的准确率是一个很好的结果! 这表明正规化的LSTM模型非常适合下一个单词预测任务,尤其是在训练数据量较小的情况下。

How about using pre-trained models?

如何使用预先训练的模型?

Now that we have explored different model architectures, it’s also worth discussing the use of pre-trained language models for this task. Currently, language models like Open GPT by OpenAI are trained on large amounts of Wikipedia data making them highly nuanced and suited to text prediction tasks. However models like Open GPT are not easily specialised onto different corpora, in our case emails. Furthermore, due to the extremely large number of parameters, these pre-trained language models are not suited to be deployed on device and are also not suited to be trained on device. One last problem with Open GPT specifically is that at the moment it is a pre-trained English model which makes it unsuitable for our bilingual language modelling problem.

既然我们已经探索了不同的模型架构,那么就值得讨论使用预训练的语言模型来完成此任务了。 当前,像OpenAI的Open GPT这样的语言模型已经在大量Wikipedia数据上进行了训练,使其非常细微,适合文本预测任务。 但是,像Open GPT这样的模型并不容易专门用于不同的语料库,在我们的案例中是电子邮件。 此外,由于参数数量过多,这些预训练的语言模型不适合在设备上部署,也不适合在设备上训练。 特别是Open GPT的最后一个问题是,目前它是一种预先训练的英语模型,使其不适合我们的双语语言建模问题。

Another popular pre-trained language model is Google’s BERT, which uses the transformer architecture discussed earlier. Unlike Open GPT, BERT is indeed available in a multilingual format and specifically for French, researchers from Inria, Facebook AI, and Sorbonne Université recently released CamemBERT (trained on 138GB of French data) is quite an exciting advancement! BERT itself was designed for masked language modelling, inspired by the Cloze task. At the moment, BERT’s power is not understood very well. While BERT can be used for a next word prediction task by setting the mask as the last word, BERT is best suited to have a left and right sequence around the mask which makes full use of its bidirectional nature. In these pre-trained language models, the process of transfer learning is quite difficult and therefore making them a general language model rather than specific use case model. Furthermore, without the specialisation the concept of personalised language models (by training on device) are even harder to achieve. This is not to say that these pre-trained models are not excellent for a first trial, you’ll surely find some interesting results with them!

另一个流行的预训练语言模型是Google的BERT,它使用了前面讨论的转换器架构。 与Open GPT不同,BERT确实以多种语言提供,并且特别适用于法语,Inria,Facebook AI和SorbonneUniversité的研究人员最近发布了CamemBERT(训练有138GB的法国数据),这是一个令人兴奋的进步! BERT本身是受Cloze任务启发而设计的,用于掩盖语言建模。 目前,对BERT的功能还不太了解。 通过将掩码设置为最后一个单词,可以将BERT用于下一个单词预测任务,而BERT最适合在掩码周围具有左右序列,从而充分利用了其双向特性。 在这些经过预训练的语言模型中,迁移学习的过程非常困难,因此使它们成为通用的语言模型,而不是特定的用例模型。 此外,如果没有专业化,个性化语言模型(通过在设备上进行培训)的概念就更难实现。 这并不是说这些经过预训练的模型在首次试用中并不出色,您一定会发现它们带来了一些有趣的结果!

Use on Inference

用于推理

So now that you’ve chosen your desired language model, how do you get it to make the predictions?

因此,既然您已经选择了所需的语言模型,那么如何进行预测呢?

lstm预测单词_下一个单词预测完整指南_第5张图片
Prediction on Inference (taken from Trung Tran’s tutorial in Machine Talk) 推理预测(摘自Trung Tran在Machine Talk中的教程)

The above diagram illustrates the inference process with an input sequence and output word(s). With our language model, for an input sequence of 6 works (let us label the words as 1,2,3,4,5,6) our model will output another set of 6 words (which should try to match 2,3,4,5,6, next word) so we choose the last word from the sequence with the highest probability as our predicted next work.

上图说明了具有输入序列和输出字的推理过程。 使用我们的语言模型,对于6个作品的输入序列(让我们将单词标记为1,2,3,4,5,6),我们的模型将输出另一组6个单词(应尝试匹配2,3, 4,5,6,下一个单词),因此我们从概率最高的序列中选择最后一个单词作为我们预测的下一个工作。

Here is how the prediction function could look for this problem:

这是预测函数查找此问题的方式:

A Final Note on Model Compression

关于模型压缩的最后说明

Okay so by now our model is ready to be deployed, eager to start predicting your words! But do you really expect to deploy models of sizes larger than 200 Mb on devices like mobiles? How are we supposed to maintain model accuracy yet using an insignificant amount of space in terms of model size? In comes model compression!

好的,现在我们的模型已经准备就绪,可以开始预测您的话了! 但是,您真的希望在移动设备等设备上部署大于200 Mb的模型吗? 我们应该如何保持模型的准确性,却要使用不多的模型尺寸空间? 随附模型压缩功能!

Industrialising NLP is a challenging task and there have been many recent developments in this field. One key breakthrough was knowledge distillation. The idea is to have two models: a teacher and a student model. Our teacher model is the complete LSTM model with all its connections and feedback loops. Now our student model is a simple 1 layer feed forward network that learns from the last softmax layer of the teacher model. This allows the student model to imitate the teacher model’s predictions while being much shallower and therefore occupying a significantly lesser amount of space! This is a very simple explanation of model compression, if you’re interested in more I suggest the following video: https://www.youtube.com/watch?v=b3zf-JylUus . The picture below illustrates this concept quite well:

NLP的工业化是一项艰巨的任务,并且该领域已经有了许多新的发展。 关键的突破是知识升华。 这个想法是有两个模型:教师模型和学生模型。 我们的教师模型是具有所有连接和反馈回路的完整LSTM模型。 现在,我们的学生模型是一个简单的1层前馈网络,可以从教师模型的最后一个softmax层中学习。 这样,学生模型就可以模仿教师模型的预测,同时要浅得多,因此占用的空间要少得多! 如果您对模型压缩感兴趣,这是一个非常简单的模型压缩说明,建议您使用以下视频: https : //www.youtube.com/watch?v=b3zf-JylUus 。 下图很好地说明了这个概念:

Knowledge Distillation visualised 知识蒸馏可视化

Conclusion — What next?

结论—接下来是什么?

We’re finally at the end…or just the beginning? There are some exciting fronts still left to discover for me, starting from dealing with out-of-vocabulary and rare words to contextual encoding. Here we explored only the surface of a next word prediction model, the next steps would be work out model personalisation while still maintaining user privacy, include contractions as well as punctuations marks, and include a beam search method for autocomplete to make this model even more interactive.

我们终于结束了……还是仅仅是开始? 从处理词汇量和稀有单词到上下文编码,还有一些激动人心的领域可供我发现。 在这里,我们仅探讨了下一个单词预测模型的表面,下一步将是在保持用户隐私的同时制定模型个性化设置,包括收缩和标点符号,以及用于自动完成的波束搜索方法,以使该模型更加完善互动。

I hope that after reading this article, you look at little word suggested by your phone keyboard a bit differently. The work that goes behind the scenes to make your user experience smooth and seamless is enormous and exciting! With current advancements in language understanding and generation, we are at a very rich time in technology and I personally can’t wait to see what is coming :)

我希望阅读本文后,您对手机键盘建议的小单词有所不同。 幕后工作使您的用户流畅,无缝地体验是巨大而令人兴奋的! 随着语言理解和生成的最新发展,我们处于非常丰富的技术时代,我个人迫不及待地想看到即将发生的事情:)

Thank you to Zied Sellami for his supervision and support, and a huge thanks to the entire Linagora team for their warm welcome even while working remotely.

感谢Zied Sellami的监督和支持,也非常感谢Linagora整个团队即使在远程工作时也受到了热烈的欢迎。

翻译自: https://medium.com/linagoralabs/next-word-prediction-a-complete-guide-d2e69a7a09e6

lstm预测单词

你可能感兴趣的:(python,机器学习,java)