【智能机器人-01 Facebook Blender】ParlAI:Recipes for building an open-domain chatbot

ParlAI是 Facebook 开源的一个可用于在多种开放可用的对话数据集上训练和评估人工智能模型的框架。一个统一的分享、训练和评估对话模型的平台,支持各种对话任务。

对应的论文为:Recipes for building an open-domain chatbot



下面简单概要下本篇论文内容:

Abstract

Building open-domain chatbots is a challenging area for machine learning research.

构建开放域的聊天机器人是机器学习领域研究中一个很挑战性的方向。

While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot.

尽管之前的工作已经表明扩大神经模型的参数数量和训练数据的大小可以获得更好的结果,不过对于高性能聊天机器人来说,我们展现了其他因素的重要性。

Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona.

良好的对话需要一系列专业的谈话者以无缝的方式融合的技能:提供引人入胜的谈话要点;倾听他们的伴侣;适当地展示知识、同理心和个性;同时保持个性的一致。

We show that large scale models can learn these skills when given appropriate training data and  choice of generation strategy.

我们zan'shi当给定适当的训练数据和生成策略选择时,大型模型可以学习这些技能。

We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements.

我们用90M、2.7B和9.4B不同大小参数的模型来构建我们的方案,并公开了模型和代码。人工评估表明在多轮对话中,我们的最优模型在投入度和人性测量方面优于现有方法。

We then discuss the limitations of this work by analyzing failure cases of our models.

最后我们通过分析我们模型的失败案例来讨论这项工作的局限性。


In this work, we provide recipes for building open domain chatbots that perform well in human evaluations.

在这项工作中,我们提供了构建在人工评估中表现良好的开放域聊天机器人的方法。

It has been shown across the field of NLP (Devlin et al., 2019) and in conversational agents in particular (Dinan et al., 2020; Zhang et al., 2019; Adiwardana et al., 2020) that pre-training on large corpora is important. Beyond simply scaling models the two main takeaways from our study are:

NLP领域(Devlin等人,2019年)和对话机器人领域(Dinan等人,2020年;Zhang等人,2019年;Adiwardana等人,2020年)都表明,大型语料库的预训练非常重要。除了简单的缩放模型参数大小之外,我们研究的两个主要收获是:

    1. Blending Skills 技能混合

         Large improvements can be made by finetuning on data that emphasizes desirable conversational skills.

      通过加强强指向性的会话技能数据进行微调可以取得很大的改进。

        We select tasks that make the model focus on personality and engagingness, knowledge, and empathy, achieving large gains by using the recently introduced Blended Skill Talk (BST) set-up (Smith et al., 2020), which targets those aspects by providing training data and initial conversational context (personas and topics).

        我们选择了一些任务,这些任务使模型侧重于个性和积极性、知识和同理心,通过使用最近引入的混合技能谈话(BST)设置(Smith et al.,2020)实现了巨大的收益,该配置通过提供训练数据和初始对话上下文(人物角色和主题)来针对这些方面优化。

        Small models using BST can match or outperform larger models that do not. While BST emphasizes desirable traits, we also show this tuning can minimize undesirable traits learnt from large corpora, such as toxicity.

        使用BST的小型模型可以达到或优于不使用BST的大型模型。虽然BST强调可取的特征,但我们也表明,这种调整可以最小化从大型语料库中学习到的不可取的特征,例如毒性。

    2. Generation Strategies 生成策略

        The choice of decoding algorithm is of critical importance, and two models with the same perplexity but different decoding algorithms can give vastly different results.

        解码算法的选择至关重要,两个具有相同的编码但不同解码算法的模型会给出截然不同的结果。

        In particular we show that the length of the bot’s utterances are crucial to human judgments of quality – too short and the responses are seen as dull or showing a lack of interest, too long and the  bot appears to waffle and not listen.

        我们特别强调,机器人说话的长度对人类对质量的判断至关重要——时间太短,反应被视为枯燥或缺乏兴趣;时间太长,机器人似乎在胡说八道并且不会倾听。

        We show, contrary to previous work which reports that beam search is inferior to sampling (Holtzman et al., 2019; Adiwardana et al., 2020), that careful choice of search hyperparameters can give strong results by controlling trade-offs.

        我们发现,与之前工作中搜索不如采样(Holtzman等人,2019年;Adiwardana等人,2020年)的结论相反,仔细选择搜索超参数可以通过控制权重得到比较好的结果。

        In particular, constraining the minimum beam length gives a crucial control of the dull versus spicy spectrum of responses.

        特别是,限制最小播报长度是一个关键点的控制点来控制反馈内容的枯燥和粗俗。

Human evaluation results are highly dependent on the precise set-up one chooses.Model performance can be strongly affected by the specific instructions given to evaluators, such as a given topic or not, the overall conversation length, and the choice of human interlocutors, which may be difficult to jointly account for.We report performance when employing crowdworkers in short multi-turn conversations with no prompt.

人工评估的结果高度依赖于人们选择什么样的精确率定义。模型性能可能会受到给评估者的具体指示的强烈影响,例如给定的主题与否、整体对话长度以及对话者的选择,这可能很难共同解释。本文中我们报告了将评测人员在多轮短对话和无提示的情况下进行评估结果。

However, in addition to that, we believe releasing models is the most reliable way to enable full insight into their capabilities.We thus make publicly available our large-scale, state of the art open-domain conversational agent, including code to fine-tune it, the model weights, and code to evaluate it, so that our setup is reproducible.

除此之外,我们相信发布模型是全面了解其功能的最可靠方式。所以,我们公开了我们的大规模、最新的开放域对话机器人,包括微调代码、模型参数和评估代码,以便我们的步骤是可复制的。

In human evaluations of engagingness our best model outperforms Meena (Adiwardana et al., 2020) in a pairwise comparison 75% to 25%, and in terms of humanness by 65% to 35% (both statistically significant, two-tailed binomial test, p < 0:01).

在人类对融入度的评估中,我们的最佳模型在成对比较中表现优于Meena(Adiwardana et al.,2020),在人性方面表现优于Meena 25%到75% ,在人性方面表现优于Meena 35%到65% (均显著,双尾二项检验,p<0:01)。

While the performance of our bot at first sight is very good, we do not believe we are yet close to solving the problem of open-domain conversation.We thus discuss limitations of our models, and initial attempts to solve them. In particular, our models still display: a lack of in-depth knowledge if sufficiently interrogated; a tendency to stick to simpler language; and a tendency to repeat oftused phrases.

虽然我们的机器人初步看性能较好,但我们认为我们还没有接近解决开放域对话的问题。因此,我们讨论了模型的局限性,以及解决这些局限性的初步尝试。在深度询问的时候我们的模型还无法深入理解知识;倾向于使用更简单的语言;还有重复常用短语的倾向。

We show how unlikelihood training and retrieve-and-refine mechanisms are potential avenues for fixing these problems; however, our initial experiments with these methods are inconclusive. We thus discuss future possibilities for alleviating these problems as well as methods to clearly expose and evaluate them.

我们展示了潜在解决这些问题的途径:强化学习和检索生成机制,不过我们对这些方法的初步实验并不确定。因此,我们讨论了解决这些问题的未来可能性并明确提出和评估这些问题的方法。


你可能感兴趣的:(【智能机器人-01 Facebook Blender】ParlAI:Recipes for building an open-domain chatbot)