Classification of Sentence Ranking Methods for Multi-Document Summarization

Classification of Sentence Ranking Methods for Multi-Document Summarization

Sean Sovine (Marshall University, USA) and Hyoil Han (Marshall University, USA)
DOI: 10.4018/978-1-4666-5019-0.ch001
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Modern information technology allows text information to be produced and disseminated at a very rapid pace. This situation leads to the problem of information overload, in which users are faced with a very large body of text that is relevant to an information need and no efficient and effective way to locate within the body of text the specific information that is needed. In one example of such a scenario, a user might be given a collection of digital news articles relevant to a particular current event and may need to rapidly generate a summary of the essential information relevant to the event contained in those articles. In extractive MDS, the most fundamental task is to select a subset of the sentences in the input document set in order to form a summary of the document set. An essential component of this task is sentence ranking, in which sentences from the original document set are ranked in order of importance for inclusion in a summary. The purpose of this chapter is to give an analysis of the most successful methods for sentence ranking that have been employed in recent MDS work. To this end, the authors classify sentence ranking methods into six classes and present/discuss specific approaches within each class.
Chapter Preview
Top

Introduction

Automatic text summarization is one attempt to solve the problem of information overload, and consists of the study of automated techniques for extracting key information from a body of text and using that information to form a concise summary of the documents in the set. The ideal of automatic summarization work is to develop techniques by which a machine can generate summaries that successfully imitate summaries generated by human beings. The category of automatic summarization actually contains a wide range of variations of the basic summarization task. This variety arises because of the many different purposes that exist for generating a summary, the different possible definitions of what a text summary is, and the great variety that exists in possible input data sources for a summarization algorithm.

The types of automatic summarization task can be divided on several axes. First, the input data that is to be summarized may be known to belong to a specific domain, or the task may be considered generic or predominately domain-independent. The input data may be from a single source document, or from multiple source documents. In the case that the input data consists of a set of documents from different sources with a common topic, the task is referred to as multi-document summarization (MDS). The summarization task may be informative, so that the summarization algorithm attempts to determine the key information in the input data using features of the input data set. On the other hand, the task may be focused summarization, in which the consumer of the summary has a particular question or specific topic that will be used to guide and motivate the summarization process. Many summarization systems are currently designed to incorporate aspects of informative and focused summarization. Finally, summarization may be abstractive or extractive (Nenkova & McKeown, 2011; Radev, Hovy, & McKeown, 2002; Sparck Jones, 1999).

Abstractive summaries are like those typically created by human summarizers, where the summary is composed of language that is generated specifically for the purpose of the summary. Extractive summaries, by contrast, are composed of sentences or parts of sentences that are extracted from the text of the input documents—and possibly rearranged or compressed—to form the final summary, with few other modifications (Nenkova & McKeown, 2011; Radev, Hovy, & McKeown, 2002). This chapter addresses extractive MDS systems. Currently, most summarization systems developed and tested for research purposes are extractive in nature.

Most current summarization research is focused on a generic multi-document summarization task that also features a query-focused component. This is largely due to conventions developed during the course of the Document Understanding Conferences (DUC) and Text Analysis Conferences (TAC) (NIST 2011; NIST 2013). The evaluation tasks developed during the DUC/TAC conferences are by far the most widely used methodologies for evaluating automatic summarization systems. We discuss the DUC/TAC conferences and their evaluation methodologies further in the section Evaluation Methodologies.

Systems developed for the DUC/TAC evaluation task are largely domain-independent, but are designed to be tested using corpora of newswire documents containing multiple topic-focused document sets. These systems are often intended to generate summaries that are both informative and focused, but some of these systems are either exclusively informative or exclusively focused in approach. Some experimental evidence suggests that, while current MDS systems are achieving continually higher levels of quality in the summaries they generate, the performance of these current systems has not yet reached the theoretical maximum of the extractive approach, which is the extractive summary containing the optimal set of document sentences (Bysani, Reddy, & Varma, 2009).

Complete Chapter List

Search this Book:
Reset