Abstract

We report on research in multi-document summarization and on evaluation of summarization in the framework of cross-lingual information retrieval. This work was carried out during a summer workshop on Language Engineering held at Johns Hopkins University by a team of nine researchers from seven universities. The goals of the research were as follows: (1) to develop a toolkit for evaluation of single-document and multi-document summarizers, (2) to develop a modular multi-document summarizer, called MEAD, that works in both English and Chinese, and (3) to perform a meta-evaluation of four automatic summarizers, including MEAD, using several types of evaluation measures: some currently used by summarization researchers and a couple of novel techniques. Central to the experiments in this workshop was the cross-lingual experimental setup based on a large-scale Chinese and English parallel corpus. An extensive set of human judgments were specifically prepared by the Linguistic Data Consortium for our research. These human judgments include a) which documents are relevant to a certain query and b) which sentences in the relevant documents are most relevant to the query and which therefore constitute a good summary of the cluster. These judgments were used to construct variable-length multiand single document summaries as model summaries. Since one of the novel evaluation metrics that we used, Relevance Correlation, is based on the premise that good summaries preserve query relevance both within a language and across languages, we made use of a cross-lingual Information Retrieval (IR) engine. We evaluated the quality of the automatic summaries using co-selection and content-based evaluation, two established techniques. A relatively new metric, relative utility, was also extensively tested. Part of the new scientific contribution is the measurement of relevance correlation, which we introduced and systematically examined in this workshop. Relevance correlation measures the quality of summaries in comparison to the entire documents as a function of how much document relevance drops if summaries are indexed instead of documents. Our results show that this measure is sensible, in that it correlates with more established evaluation measures. Another contribution is the cross-lingual setup which allows us to automatically translate English queries into Chinese, perform Chinese IR with or without summarization. This allows us to calculate relevance correlation for English and for Chinese in parallel (i.e., for the same queries) and to take direct cross-lingual comparisons of evaluations. Additionally, an alternative way of constructing Chinese model summaries from English ones was implemented which relies on the sentence alignment of English and Chinese documents. The results of our large-scale meta-evaluation are numerous, but some of the highlights are the following: (1) All evaluation measures rank human summaries first, which is an appropriate and expected property of such measures, (2) Both relevance correlation and the content-based measures place leading sentence extracts ahead of the more sophisticated summarizers, (3) Relative utility ranks our system, MEAD, as the best summarizer for shorter summaries, although for longer summaries, lead-based summaries outperform MEAD, (4) Co-selection measurements show overall low agreement amongst humans (above chance), whereas relative utility reports higher numbers on the same data (but does not normalize for chance). The deliverable resources and software include: (1) a turn-key extractive multi-document summarizer, MEAD, which allows users to add their own features based on single sentences or pairs of sentences, (2) a large corpus of summaries produced by several automatic methods, including baseline and random summaries, (3) a collection of manual summaries produced by the Linguistic Data Consortium (LDC), (4) a battery of evaluation routines, (5) a collection of IR queries in English and Chinese and the corresponding relevance judgments from the Hong Kong news collection, (6) SMART relevance outputs for both full documents and summaries, (7) XML tools for processing of documents and summaries. JHU 2001 Summer workshop final report Evaluation of Text Summarization

Keywords

Automatic summarizationComputer scienceInformation retrievalRelevance (law)Multi-document summarizationPremiseSet (abstract data type)Natural language processingConstruct (python library)Cross-language information retrievalArtificial intelligenceQuery expansionLinguistics

Related Publications

Automatic Summarization

With the explosion in the quantity of on-line text and multimedia information in recent years, there has been a renewed interest in automatic summarization. This book provides a...

2001 Natural language processing 834 citations

Introducing LETOR 4.0 Datasets

LETOR is a package of benchmark data sets for research on LEarning TO Rank, which contains standard features, relevance judgments, data partitioning, evaluation tools, and sever...

2013 arXiv (Cornell University) 199 citations

Publication Info

Year
2011
Type
article
Citations
27
Access
Closed

External Links

Citation Metrics

27
OpenAlex

Cite This

Dragomir Radev, Simone Teufel, Horacio Saggion et al. (2011). Evaluation of Text Summarization in a Cross-lingual Information Retrieval Framework. .