A Study of Translation Edit Rate with Targeted Human Annotation
Matthew Snover and Bonnie Dorr
Institute for Advanced Computer Studies
University of Maryland
College Park, MD 20742
{snover,bonnie}@umiacs.umd.edu
本文重要信息摘要:
1、Translation Edit Rate (TER) measures the amount of editing that a human would have to perform to change a system output so it exactly matches a reference translation.
2、The methods of automatic machine translation consist of BLEU, METEOR,NIST,TER and so on.
3、We define a new, more intuitive measure of “goodness” of MT output—specifically, the number of edits needed to fix the output so that it semantically matches a correct translation.
4、Recently the GALE (Olive, 2005) (Global Autonomous Language Exploitation) research program introduced a new error measure called Translation Edit Rate (TER) that was originally designed to count the number of edits (including phrasal shifts) performed by a human to change a hypothesis so that it is both fluent and has the correct meaning. This was then decomposed into two steps: defining a new reference and finding the minimum number
of edits so that the hypothesis exactly matches one of the references. This measure was defined such that all edits, including shifts, would have a cost of one. Finding only the minimum number of ed-its, without generating a new reference is the measure defined as TER; finding the minimum of edits to a new targeted references is defined as human-targeted TER (or HTER).
5、BLEU (Papineni et al., 2002) calculates the score of a translation by measuring the number of n-grams, of varying length, of the system output that occur within the set of references.
6、METEOR (Banerjee and Lavie, 2005) is an evaluation measure that counts the number of exact word matches between the system output and reference. Unmatched words are then stemmed and matched. Additional penalities are assessed for reordering the words between the hypothesis and reference. This method has been shown to correlate very well with human judgments.
7、TER is defined as the minimum number of edits needed to change a hypothesis so that it exactly matches one of the references, normalized by the average length of the references.
8、Possible edits include the insertion, deletion, and substitution of single words as well as shifts of word sequences.
10、The number of insertions, deletions, and substitutions is calculated using dynamic programming. A greedy search is used to find the set of shifts, by repeatedly selecting the shift that most reduces the number of insertions, deletions and substitutions, until no more beneficial shifts remain.
12、In both TER and HTER, the majority of the edits were substitutions and deletions.
13、 In an analysis of shift size and distance, we found that most shifts are short in length (1 word) and are
by less than 7 words.