Abstract
Following the recent adoption by the machine translation community of automatic evaluation using the BLEU/NIST scoring process, we conduct an in-depth study of a similar idea for evaluating summaries. The results show that automatic evaluation using unigram co-occurrences between summary pairs correlates surprising well with human evaluations, based on various statistical metrics; while direct application of the BLEU evaluation procedure does not always give good results.
Keywords
Affiliated Institutions
Related Publications
Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics
In this paper we describe two new objective automatic evaluation methods for machine translation. The first method is based on longest common subsequence between a candidate tra...
Effective Approaches to Attention-based Neural Machine Translation
An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation.However, the...
BLEU deconstructed: Designing a Better MT Evaluation Metric
BLEU is the de facto standard automatic evaluation met-ric in machine translation. While BLEU is undeniably useful, it has a number of limitations. Although it works well for la...
ROUGE: A Package for Automatic Evaluation of Summaries
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) sum...
BLEU
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a me...
Publication Info
- Year
- 2003
- Type
- article
- Volume
- 1
- Pages
- 71-78
- Citations
- 1573
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.3115/1073445.1073465