Evaluation-文本摘要-ROUGE

最近,在对生成摘要的文本进行评估时,需要去重温ROUGE的定义。
同时,意外找到了人们针对文本摘要的衡量方式。

  1. summary是否通顺(fluent)
  2. summary是否足够(adequate)?举例而言,缩写的长度是否合适;是否涵盖了原文所有最重要的信息

ROUGE 在这里是用来评估足够性(adequate)这个指标,具体做的是通过简单计数,在生成的summary中有多少个n-grams是匹配参考summary(ground truth)的n-grams的。
(或者是多个summaries,因为可能存在多个参考summary。如果是多个reference summary的情况 ROUGE-1的得分是经过平均的。)

由于ROUGE是基于是基于内容重叠的,所以它能够决定生成的summary和参考的summary是不是在讨论大致的概念,但是并不能去考虑这两者的出来的结论是否一直,生成的summary是否是有道理的(sensible)

在维基百科上,是这么解释的。

ROUGE-N: Overlap of N-grams between the system and reference summaries.

  • ROUGE-1 refers to the overlap of 1-gram (each word) between the system and reference summaries.
  • ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.

ROUGE-L: Longest Common Subsequence (LCS) based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.

ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes .

ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.

ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.

ROUGE与BLEU几乎一模一样,但是BLEU计算的是准确率,ROUGE计算的是召回率。
其次ROUGE的词可以不是连续的,而BLEU的n-gram要求词语必须连续出现。
比如两句话“我喜欢吃香蕉”和“我刚才吃了一个香蕉”的最长公共子串为“我 吃 香 蕉”

你可能感兴趣的:(Evaluation-文本摘要-ROUGE)