LLM_文本生成评估指标

一、BLEU (precision-based metric)

  • 评估准确率: 和准确率类似-生成的多少词出现在reference的词中
  • 引起问题1:
    • 如果生成重复的词,并且该词在引用中出现,那么我们会得到较高的分数
    • 针对这点作者指出修正方法:一个单词只计算它在引用中出现的次数。
      • example:
        • ref-“the cat is on the mat”
        • g-“the the the the the the”
      • P v a n l i l l a = 6 6 P_{vanlilla}=\frac{6}{6} Pvanlilla=66, P m o d = 2 6 P_{mod}=\frac{2}{6} Pmod=62
  • 修正问题1:clip
    • 这意味着一个n-gram的出现次数以它在参考句中出现的次数为上限

p n = ∑ g e S n t ∈ C ∑ n − g r a m ∈ g e S n t C o u n t c l i p ( n − g r a m ) ∑ g e S n t ∈ C ∑ n − g r a m ∈ g e S n t C o u n t ( n − g r a m ) p_n=\frac{ \sum_{geSnt \in C}\sum_{n-gram \in geSnt} Count_{clip}(n-gram) }{ \sum_{geSnt \in C}\sum_{n-gram \in geSnt} Count(n-gram) } pn=geSntCngramgeSntCount(ngram)geSntCngramgeSntCountclip(ngram)

  • 引起问题2:
    • 因为这个准确率的评估,很显然会对较短的评估对有力,会低估较长生成的结果。
  • 修正问题2:简短惩罚 brevity penalty
    • B R = m i n ( 1 , e 1 − l r e f l g e n ) BR = min(1, e^{1 - \frac{l_{ref}}{l_{gen}}} ) BR=min(1,e1lgenlref) : 生成长度大于原句子:1, 生成长度小于原句子: ( 0 , 1 ) (0, 1) (0,1),

最终公式:

B L E U − N = B R ∗ ( ∏ n = 1 N p n ) 1 / N BLEU-N=BR * (\prod_{n=1}^N p_n)^{1/N} BLEUN=BR(n=1Npn)1/N

Example: 计算BLEU-4

  • ref-“the cat sat on the mat”
  • g-“the cat the cat is on the mat”
  • BR: B R = m i n ( 1 , e 1 − 6 / 8 ) = 1 BR=min(1, e^{1-6/8})=1 BR=min(1,e16/8)=1
  • n=1
    • 1-gram: org:{"the", "cat", "sat", "on", "mat"} ge:{"the", "cat", "is", "on", "mat"}
    • clip: c o u n t c l i p ( " t h e " ) = 2 , c o u n t c l i p ( " c a t " ) = 1 , c o u n t c l i p ( " i s " ) = 0 , 1 − g r a m ∈ g e S n t count_{clip}("the") = 2, count_{clip}("cat") = 1, count_{clip}("is") = 0, 1-gram \in geSnt countclip("the")=2,countclip("cat")=1,countclip("is")=0,1gramgeSnt
    • p 1 = 5 8 p_1 = \frac{5}{8} p1=85
  • n=2
    • 2-gram: org:{"the cat", "cat sat", "sat on", "on the", "the mat"} ge:{"the cat", "cat the", "cat is", "is on", "on the", "the mat"}
    • p 2 = 3 7 p_2 = \frac{3}{7} p2=73
  • n=3
    • 3-gram: org:{"the cat sat", "cat sat on", "sat on the", "on the mat"} ge:{"the cat the", "cat the cat", "the cat is", "cat is on", "is on the", "on the mat"}
    • p 3 = 1 6 p_3 = \frac{1}{6} p3=61
  • n=4
    • 3-gram: org:{"the cat sat on", "cat sat on the", "sat on the mat"} ge:{"the cat the cat", "cat the cat is", "the cat is on", "cat is on the", "is on the mat"}
    • p 4 = 0 5 p_4 = \frac{0}{5} p4=50
  • BLEU-4: 1 ∗ ( 5 8 ∗ 3 7 ∗ 1 6 ∗ 0 5 ) 1 / 4 = 0. 1 * (\frac{5}{8}*\frac{3}{7}*\frac{1}{6}*\frac{0}{5})^{1/4}=0. 1(85736150)1/4=0.

1.1 huggingface load_metric 调用sacrebleu

可以看出包内的计算原理同上述

from datasets import load_metric
!pip install sacrebleu
bleu_metric = load_metric("sacrebleu")
bleu_metric.add(prediction="the cat the cat is on the mat", reference=["the cat sat on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results
"""
{'score': 0.0,
 'counts': [5, 3, 1, 0],
 'totals': [8, 7, 6, 5],
 'precisions': [62.5, 42.857142857142854, 16.666666666666668, 0.0],
 'bp': 1.0,
 'sys_len': 8,
 'ref_len': 6}
"""

二、 ROUGE (recall-based metric)

  • 评估召回: 和召回率类似- reference的词中有多少出现在生成词中

R O U G E = ∑ o r g S n t ∈ C ∑ n − g r a m ∈ o r g S n t C o u n t m a t c h ( n − g r a m ) ∑ o r g S n t ∈ C ∑ n − g r a m ∈ o r g S n t C o u n t ( n − g r a m ) ROUGE = \frac{ \sum_{orgSnt \in C}\sum_{n-gram \in orgSnt} Count_{match}(n-gram) }{ \sum_{orgSnt \in C}\sum_{n-gram \in orgSnt} Count(n-gram) } ROUGE=orgSntCngramorgSntCount(ngram)orgSntCngramorgSntCountmatch(ngram)

  • 对于最长公共子串longest common substring有个单独的分数ROUGE-L

R L C S = L C S ( X , Y ) m ; P L C S = L C S ( X , Y ) n R_{LCS}=\frac{LCS(X,Y)}{m}; P_{LCS}=\frac{LCS(X,Y)}{n} RLCS=mLCS(X,Y);PLCS=nLCS(X,Y)
F L C S = ( 1 + β 2 ) R L C S P L C S R L C S + β P L C S , β = P L C S R L C S F_{LCS}=\frac{(1+\beta ^2)R_{LCS}P_{LCS}}{R_{LCS}+\beta P_{LCS}}, \beta=\frac{P_{LCS}}{R_{LCS}} FLCS=RLCS+βPLCS(1+β2)RLCSPLCS,β=RLCSPLCS

Example: 计算 ROUGE1

  • ref-“the cat sat on the mat”
  • g-“the cat the cat is on the mat”
  • 1-gram: org:{“the”, “cat”, “sat”, “on”, “mat”} ge:{“the”, “cat”, “is”, “on”, “mat”}
  • R O U G E − 1 r = 2 + 1 + 0 + 1 + 1 6 = 5 6 ROUGE-1^{r}=\frac{2+1+0+1+1}{6}=\frac{5}{6} ROUGE1r=62+1+0+1+1=65
  • B L E U − 1 p = m i n ( 3 , 2 ) + m i n ( 2 , 1 ) + 0 + 1 + 1 8 = 5 8 BLEU-1^{p}=\frac{min(3,2)+min(2,1)+0+1+1}{8}=\frac{5}{8} BLEU1p=8min(3,2)+min(2,1)+0+1+1=85

1.2 huggingface load_metric 调用sacrebleu

可以看出包内的计算原理同上述

from datasets import load_metric
!pip install rouge_score
rouge_metric = load_metric("rouge")
rouge_metric.add(prediction="the cat the cat is on the mat", reference=["the cat sat on the mat"])
results = rouge_metric.compute()
print(1/(0.5* 1/0.625 + 0.5* 1/0.8333333333333334))
results
"""
0.7142857142857143
{'rouge1': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143)),
 'rouge2': AggregateScore(low=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5), mid=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5), high=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5)),
 'rougeL': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143)),
 'rougeLsum': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143))}

""" 

你可能感兴趣的:(深度学习,机器学习,算法,深度学习)