【文本生成评价指标】文本生成客观评价指标总结py

这篇博客是对来自betheme.net网站的一篇博客文本生成客观评价指标总结的整理和扩充。
检查了其中代码是否可用,链接是否真实,并对错误代码修改。部分代码已在博主博客中实现,并附传送链接。

1.基于词重叠的评价指标

基于词重叠的评价指标关注词汇分布的相似性,主要包括BLEU、ROUGE、METEOR、NIST和distinct. 其中BLEU和METEOR常用于机器翻译任务,ROUGE常用于自动文本摘要。

(1) BLEU(Bilingual Evaluation Understudy)

关注精确率,计算生成文本中n-gram出现在参照文本中的比例,适用于数据集量级,在句子级表现不佳。

  • 相关文章:Bleu: a Method for Automatic Evaluation of Machine Translation
  • 代码实现:【文本生成评价指标】 BLEU原理及代码示例py

(2) ROUGE(Recall-Oriented Understudy for Gisting Evaluation)

是对BLEU的改进,关注召回率,计算参照文本中n-gram出现在生成文本中的比例,又可细分为ROUGE-N、ROUGE-L、ROUGE-W、ROUGE-S.

  • 相关文章:Rouge: A package for automatic evaluation of summaries
  • 代码实现:【文本生成评价指标】 ROUGE原理及代码示例py

(3) METEOR

是对BLEU的改进,考虑了生成文本与参照文本之间的对齐关系,使用WordNet计算特定的序列匹配,同义词,词根和词缀,释义之间的匹配关系。

  • 相关文章:METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
  • 代码实现:【文本生成评价指标】 METEOR原理及代码示例py

(4) NIST(National Institute of standards and Technology)

是对BLEU的改进,引入了每个n-gram的信息量的概念。

  • 代码实现:机器翻译评测——BLEU改进后的NIST算法

(5) DISTINCT

用于评价生成文本的多样性,计算生成文本中不重复的n-gram数量与n-gram总数量的比值。

  • 相关文章:A diversity-promoting objective function for neural conversation models
  • 代码实现:【文本生成评价指标】 DISTINCT原理及代码示例py

2.基于词向量的评价指标

基于词向量的评价指标关注语义分布的相似性,主要包括 Embedding Average Score、Greedy Matching Score、Vector Extrema Score.

(1) Embedding Average Score

分别对参照文本和生成文本的词向量取平均,作为二者的文本特征,然后计算二者的余弦相似度。

(2) Greedy Matching Score

取参照文本和生成文本最相似的一对单词的词向量的余弦相似度作为二者的相似度。

(3) Embedding Average Score

分别对参照文本和生成文本,取句中各单词词向量每个维度的最大值作为句子向量对应维度的最大值,得到两个句向量,然后计算二者的余弦相似度。

基于词重叠的评价指标(BLEU、ROUGE、METEOR)和基于词向量的评价指标(Embedding Average Score、Greedy Matching Score、Vector Extrema Score)相关代码直接调用nlg-eval实现,nlg-eval安装参考此篇博文。

from nlgeval import NLGEval

references = ["This is a cat", "This is a feline"]
predictions = ["This is my cat"]

references = [[r] for r in references]
nlgeval_ = NLGEval()
ans = nlgeval_.compute_metrics(hyp_list=predictions,ref_list = references)
print(ans)

''' {
	'Bleu_1': 0.7499999996250004, 
	'Bleu_2': 0.4999999997291671, 
	'Bleu_3': 4.999999996944452e-06, 
	'Bleu_4': 1.8803015450937985e-08, 
	'METEOR': 0.8559670781893004, 
	'ROUGE_L': 0.75, 
	'CIDEr': 0.0, 
	'SkipThoughtCS': 0.85018563, 
	'EmbeddingAverageCosineSimilarity': 0.810215, 
	'EmbeddingAverageCosineSimilairty': 0.810215, 
	'VectorExtremaCosineSimilarity': 0.77313, 
	'GreedyMatchingScore': 0.867042
 }'''

3.基于语言模型的评价指标

基于语言模型的评价指标通过使用语言模型,来计算参照文本和生成文本的相似度,主要包括BertScore、BARTScore、MoverScore、BLEURT及Perplexity.

(1) BertScore

对生成文本和参照文本(word piece进行tokenize)分别用bert提取特征,然后对两个句子的每一个词分别计算内积,可以得到一个相似性矩阵。基于这个矩阵,分别对参照文本和生成文本进行最大相似性得分的累加然后归一化,得到bertscore的precision,recall和F1值。

相关文章:BERTScore: Evaluating Text Generation with BERT (ICLR 2020)
代码链接:https://github.com/Tiiiger/bert_score

from bert_score import score
def BertScore(predictions, references):
	P, R, F1 = score(predictions, references, lang="en", verbose=True, model_type = "distilbert-base-uncased")
	# print(f"System level F1 score: {F1.mean():.3f}")
	return P, R, F1

from bert_score import plot_example
def DrawBertScoreSimilarityMatrix(pred, ref):
	plot_example(pred, ref, lang="en", fname="/data/WWW/extra/metrics/bert_score.png")  

# prediction 和 reference 长度必须相同
predictions = ['This is a cat', 'It is so lovely!']
references = ["What's the weather today?", 'So cute!']
print(BertScore(predictions, references))  # (tensor([0.6350, 0.8027]), tensor([0.6267, 0.8685]), tensor([0.6308, 0.8343]))
pred, ref = predictions[1], references[1]
DrawBertScoreSimilarityMatrix(pred, ref)

【文本生成评价指标】文本生成客观评价指标总结py_第1张图片

代码实现-版本2 调用 huggingface 中的 evaluate 库实现:https://huggingface.co/spaces/evaluate-metric/bertscore

'''
The original BERTScore paper showed that BERTScore correlates well with human judgment on sentence-level and system-level evaluation, but this depends on the model and language pair selected.
'''

import evaluate

bertscore = evaluate.load("bertscore")
predictions = ['This is a cat', 'It is so lovely!']
references = ["What's the weather today?", 'So cute!']

results = bertscore.compute(predictions=predictions, references=references, lang="en", model_type = "distilbert-base-uncased")
print(results)

'''{
'precision': [0.6349790096282959, 0.8026740550994873], 
'recall': [0.6267067193984985, 0.8684506416320801], 
'f1': [0.6308157444000244, 0.8342678546905518], 
'hashcode': 'distilbert-base-uncased_L5_no-idf_version=0.3.12(hug_trans=4.23.1)'
}'''

(2) BARTScore

将生成文本的评估看做是文本生成任务,采用无监督学习对生成文本的不同方面 (e.g. informativeness, fluency, or factuality) 进行评估。

相关文章:BARTSCORE: Evaluating Generated Text as Text Generation(NeuralPS 2021)
代码链接:https://github.com/neulab/BARTScore

import torch
import torch.nn as nn
import traceback
from transformers import BartTokenizer, BartForConditionalGeneration
from typing import List
import numpy as np


class BARTScorer:
    def __init__(self, device='cuda:0', max_length=1024, checkpoint='facebook/bart-large-cnn'):
        # Set up model
        self.device = device
        self.max_length = max_length
        self.tokenizer = BartTokenizer.from_pretrained(checkpoint)
        self.model = BartForConditionalGeneration.from_pretrained(checkpoint)
        self.model.eval()
        self.model.to(device)

        # Set up loss
        self.loss_fct = nn.NLLLoss(reduction='none', ignore_index=self.model.config.pad_token_id)
        self.lsm = nn.LogSoftmax(dim=1)

    def load(self, path=None):
        """ Load model from paraphrase finetuning """
        if path is None:
            path = 'models/bart.pth'
        self.model.load_state_dict(torch.load(path, map_location=self.device))

    def score(self, srcs, tgts, batch_size=4):
        """ Score a batch of examples """
        score_list = []
        for i in range(0, len(srcs), batch_size):
            src_list = srcs[i: i + batch_size]
            tgt_list = tgts[i: i + batch_size]
            try:
                with torch.no_grad():
                    encoded_src = self.tokenizer(
                        src_list,
                        max_length=self.max_length,
                        truncation=True,
                        padding=True,
                        return_tensors='pt'
                    )
                    encoded_tgt = self.tokenizer(
                        tgt_list,
                        max_length=self.max_length,
                        truncation=True,
                        padding=True,
                        return_tensors='pt'
                    )
                    src_tokens = encoded_src['input_ids'].to(self.device)
                    src_mask = encoded_src['attention_mask'].to(self.device)

                    tgt_tokens = encoded_tgt['input_ids'].to(self.device)
                    tgt_mask = encoded_tgt['attention_mask']
                    tgt_len = tgt_mask.sum(dim=1).to(self.device)

                    output = self.model(
                        input_ids=src_tokens,
                        attention_mask=src_mask,
                        labels=tgt_tokens
                    )
                    logits = output.logits.view(-1, self.model.config.vocab_size)
                    loss = self.loss_fct(self.lsm(logits), tgt_tokens.view(-1))
                    loss = loss.view(tgt_tokens.shape[0], -1)
                    loss = loss.sum(dim=1) / tgt_len
                    curr_score_list = [-x.item() for x in loss]
                    score_list += curr_score_list

            except RuntimeError:
                traceback.print_exc()
                print(f'source: {src_list}')
                print(f'target: {tgt_list}')
                exit(0)
        return score_list

    def multi_ref_score(self, srcs, tgts: List[List[str]], agg="mean", batch_size=4):
        # Assert we have the same number of references
        ref_nums = [len(x) for x in tgts]
        if len(set(ref_nums)) > 1:
            raise Exception("You have different number of references per test sample.")

        ref_num = len(tgts[0])
        score_matrix = []
        for i in range(ref_num):
            curr_tgts = [x[i] for x in tgts]
            scores = self.score(srcs, curr_tgts, batch_size)
            score_matrix.append(scores)
        if agg == "mean":
            score_list = np.mean(score_matrix, axis=0)
        elif agg == "max":
            score_list = np.max(score_matrix, axis=0)
        else:
            raise NotImplementedError
        return list(score_list)

    def test(self, batch_size=3):
        """ Test """
        src_list = [
            'This is a very good idea. Although simple, but very insightful.',
            'Can I take a look?',
            'Do not trust him, he is a liar.'
        ]

        tgt_list = [
            "That's stupid.",
            "What's the problem?",
            'He is trustworthy.'
        ]

        print(self.score(src_list, tgt_list, batch_size))

(3) MoverScore

对BertScore的改进,原理讲解可查看本篇文章

相关文章:MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance(EMNLP 2019)
代码链接:https://github.com/AIPHES/emnlp19-moverscore

(4) BELURT

通过fine-tune语义相似度任务,用来做评价指标,原理讲解可查看本篇文章

相关文章:BLEURT: Learning Robust Metrics for Text Generation(ACL 2020)
代码链接:https://github.com/google-research/bleurt

计算BLEURT时,调用 huggingface 中的 evaluate 库实现:https://huggingface.co/spaces/evaluate-metric/bleurt

import evaluate

predictions = ["hello there", "general kenobi"]
references = ["hello there", "general kenobi"]
bleurt = evaluate.load("bleurt", module_type="metric", checkpoint="bleurt-base-128")

results = bleurt.compute(predictions=predictions, references=references)
print(results)
# {'scores': [1.0295498371124268, 1.0445425510406494]}

(5) Perplexity

困惑度,用来评估生成文本的流畅性。其值越小,说明语言模型的建模能力越好,即生成的文本越接近自然语言。

相关文章:Perplexity—a measure of the difficulty of speech recognition tasks

计算困惑度时,调用 huggingface 中的 evaluate 库实现:https://huggingface.co/spaces/evaluate-metric/perplexity

from collections import defaultdict
import evaluate
import datasets
'''
Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.
'''
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
results = perplexity.compute(model_id='gpt2',add_start_token=False, predictions=input_texts)
input_texts = datasets.load_dataset("wikitext","wikitext-2-raw-v1", split="test")["text"][:50]
input_texts = [s for s in input_texts if s!='']

results = perplexity.compute(model_id='gpt2', predictions=input_texts)print(list(results.keys()))
print(round(results["mean_perplexity"], 2))

for i in results["perplexities"]:  # 打印每句话的ppl
	print(round(i, 2))
	# ['perplexities', 'mean_perplexity']
	# 646.74
	# 32.25
	# 1499.69
	# 408.28

4.基于距离的评价指标

基于距离的评价指标是一种典型的 “错误率” 的度量方法,类似指标还有 WER,PER 等,只是在 “错误” 的定义上略有不同。这些指标多用于机器翻译任务中。

TER(Translation Edit Rate )

计算将生成文本转换为参照文本所需要的最少编辑操作次数。

计算TER时,调用 huggingface 中的 evaluate 库实现:https://huggingface.co/spaces/evaluate-metric/ter

import evaluate

predictions = ["does this sentence match??", "what about this sentence?"]
references = [["does this sentence match", "does this sentence match!?!"],["wHaT aBoUt ThIs SeNtEnCe?", "wHaT aBoUt ThIs SeNtEnCe?"]]
ter = evaluate.load("ter")

results = ter.compute(predictions = predictions, references = references, normalized = True,case_sensitive = True)
print(results)
# {'score': 57.14285714285714, 'num_edits': 6, 'ref_length': 10.5}

5.基于学习的评价指标

使用机器学习/神经网络的方法来学习一个好的评价指标,使得模型打分和人工打分更接近。像各种 GANs、ADEM、Dual Encoder 等。

6.其他评价指标

还有一些针对特定文本生成任务设计的指标,如图像描述生成任务中的CIDEr和SPICE指标、data-to-text中的相关指标等。

(1) CIDEr(Translation Edit Rate )

将每个句子都看作“文档”,将其表示成 Term Frequency Inverse Document Frequency(tf-idf)向量的形式,通过对每个n元组进行(TF-IDF) 权重计算,计算参照描述文本与生成描述文本的余弦相似度,来衡量图像标注的一致性的。

相关文章:CIDEr: Consensus-based image description evaluation

计算CIDEr时,可以直接调用nlg-eval实现。

(2) SPICE(Semantic Propositional Image Caption Evaluation)

使用基于图的语义表示来编码描述中的 objects, attributes 和 relationships。
它先将待评价 caption 和参考 captions 用 Probabilistic Context-Free Grammar (PCFG) dependency parser parse 成 syntactic dependencies trees,然后用基于规则的方法把 dependency tree 映射成 scene graphs。最后计算待评价的 caption 中 objects, attributes 和 relationships 的 F-score 值。

相关文章:SPICE: Semantic Propositional Image Caption Evaluation

参考资料

  1. 文本生成客观评价指标总结(附Pytorch代码实现)
  2. 文本生成13:万字长文梳理文本生成评价指标
  3. 对话系统评价指标

你可能感兴趣的:(NLP,python,nlp,人工智能)