precision-based metric
)clip
n-gram
的出现次数以它在参考句中出现的次数为上限p n = ∑ g e S n t ∈ C ∑ n − g r a m ∈ g e S n t C o u n t c l i p ( n − g r a m ) ∑ g e S n t ∈ C ∑ n − g r a m ∈ g e S n t C o u n t ( n − g r a m ) p_n=\frac{ \sum_{geSnt \in C}\sum_{n-gram \in geSnt} Count_{clip}(n-gram) }{ \sum_{geSnt \in C}\sum_{n-gram \in geSnt} Count(n-gram) } pn=∑geSnt∈C∑n−gram∈geSntCount(n−gram)∑geSnt∈C∑n−gram∈geSntCountclip(n−gram)
brevity penalty
最终公式:
B L E U − N = B R ∗ ( ∏ n = 1 N p n ) 1 / N BLEU-N=BR * (\prod_{n=1}^N p_n)^{1/N} BLEU−N=BR∗(n=1∏Npn)1/N
Example: 计算BLEU-4
{"the", "cat", "sat", "on", "mat"}
ge:{"the", "cat", "is", "on", "mat"}
{"the cat", "cat sat", "sat on", "on the", "the mat"}
ge:{"the cat", "cat the", "cat is", "is on", "on the", "the mat"}
{"the cat sat", "cat sat on", "sat on the", "on the mat"}
ge:{"the cat the", "cat the cat", "the cat is", "cat is on", "is on the", "on the mat"}
{"the cat sat on", "cat sat on the", "sat on the mat"}
ge:{"the cat the cat", "cat the cat is", "the cat is on", "cat is on the", "is on the mat"}
load_metric
调用sacrebleu
可以看出包内的计算原理同上述
from datasets import load_metric
!pip install sacrebleu
bleu_metric = load_metric("sacrebleu")
bleu_metric.add(prediction="the cat the cat is on the mat", reference=["the cat sat on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results
"""
{'score': 0.0,
'counts': [5, 3, 1, 0],
'totals': [8, 7, 6, 5],
'precisions': [62.5, 42.857142857142854, 16.666666666666668, 0.0],
'bp': 1.0,
'sys_len': 8,
'ref_len': 6}
"""
recall-based metric
)R O U G E = ∑ o r g S n t ∈ C ∑ n − g r a m ∈ o r g S n t C o u n t m a t c h ( n − g r a m ) ∑ o r g S n t ∈ C ∑ n − g r a m ∈ o r g S n t C o u n t ( n − g r a m ) ROUGE = \frac{ \sum_{orgSnt \in C}\sum_{n-gram \in orgSnt} Count_{match}(n-gram) }{ \sum_{orgSnt \in C}\sum_{n-gram \in orgSnt} Count(n-gram) } ROUGE=∑orgSnt∈C∑n−gram∈orgSntCount(n−gram)∑orgSnt∈C∑n−gram∈orgSntCountmatch(n−gram)
longest common substring
有个单独的分数ROUGE-L
R L C S = L C S ( X , Y ) m ; P L C S = L C S ( X , Y ) n R_{LCS}=\frac{LCS(X,Y)}{m}; P_{LCS}=\frac{LCS(X,Y)}{n} RLCS=mLCS(X,Y);PLCS=nLCS(X,Y)
F L C S = ( 1 + β 2 ) R L C S P L C S R L C S + β P L C S , β = P L C S R L C S F_{LCS}=\frac{(1+\beta ^2)R_{LCS}P_{LCS}}{R_{LCS}+\beta P_{LCS}}, \beta=\frac{P_{LCS}}{R_{LCS}} FLCS=RLCS+βPLCS(1+β2)RLCSPLCS,β=RLCSPLCS
Example: 计算 ROUGE1
load_metric
调用sacrebleu
可以看出包内的计算原理同上述
from datasets import load_metric
!pip install rouge_score
rouge_metric = load_metric("rouge")
rouge_metric.add(prediction="the cat the cat is on the mat", reference=["the cat sat on the mat"])
results = rouge_metric.compute()
print(1/(0.5* 1/0.625 + 0.5* 1/0.8333333333333334))
results
"""
0.7142857142857143
{'rouge1': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143)),
'rouge2': AggregateScore(low=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5), mid=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5), high=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5)),
'rougeL': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143)),
'rougeLsum': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143))}
"""