相似度系列9: unify USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

模型特点:multiple model variants

response, r, conditioned on dialog context, c, and fact, f. The input to the transformer is the concatenation of c and f

Different decoding strategies are used to obtain four different outputs from this model.

  1. standard argmax sampling
  2. nucleus sampling (Holtzman et al., 2019) is used at three different rates: p = {0.3, 0.5, 0.7}

数据构建

没有选择众包。文中有原因:然而,没有使用众包,因为(1)注释说明很长,(2)进行了初步的注释,然后是小组讨论,(3)有许多来自少数注释者的注释,可以检查注释者的主观性。

**数据标注过程:**注释者得到了一套说明(附录A)。进行了一次小规模的初步注释,每个人注释了5个对话背景(总共30个回答)。对每个问题都计算了注释者之间的一致性。在初步通过和讨论会议之后,对指示进行了改进(例如,维持语境被改为3分,而不是2分)。 在对指示进行修改后,进行了全面的注释工作。

关心的指标有:

  • 可理解的(0-1)。鉴于之前的背景,该反应是否可以理解?
    • 自然(1-3)。该反应是否看起来是一个人自然会说的东西?
    • 保持语境(1-3)。答复是否作为前面对话的有效延续?
    • 有趣(1-3)。答复是枯燥的还是有趣的?
    • 使用知识(0-1)。考虑到该回答所依据的事实,该回答在多大程度上使用了该事实?
    • 整体质量(1-5)。鉴于你的上述答案,你对这段话的质量的总体印象是什么?

• Understandable (0 - 1): Is the response understandable given the previous context?
• Natural (1 - 3): Does the response seem to be
something that a person would naturally say?
• Maintains Context (1 - 3): Does the response
serve as a valid continuation of the preceding
conversation?
• Interesting (1 - 3): Is the response dull or
interesting?
• Uses Knowledge (0 - 1): Given the fact that
the response is conditioned on, how well does
the response use that fact?
• Overall Quality (1 - 5): Given your answers
above, what is your overall impression of the
quality of this utterance?

Three models were used to generate system outputs: a sequence-to-sequence model (Seq2Seq),
an LSTM language model (LM) and a Key-Value
Profile Memory Network (KV-MemNN)

相似度系列9: unify USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation_第1张图片

相似度系列9: unify USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation_第2张图片

你可能感兴趣的:(相似度,分类)