A Keyphrase-Based Approach to Text Summarization for English and Bengali Documents

A Keyphrase-Based Approach to Text Summarization for English and Bengali Documents

Kamal Sarkar (Computer Science and Engineering Department, Jadavpur University, Kolkata, India)
Copyright: © 2014 |Pages: 11
DOI: 10.4018/ijtd.2014040103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

With the rapid growth of the World Wide Web, information overload is becoming a problem for an increasingly large number of people. Since summarization helps human to digest the main contents of a text document very rapidly, there is a need for an effective and powerful tool that can automatically summarize text. In this paper, we present a keyphrase based approach to single document summarization that extracts first a set of keyphrases from a document, use the extracted keyphrases to choose sentences from the document and finally form an extractive summary with the chosen sentences. We view keyphrases (single or multi-word) as the important concepts and we assume that an extractive summary of a document is an elaboration of the important concepts contained in the document to some permissible extent and it is controlled by the given summary length. We have tested our proposed keyphrase-based summarization approach on two different datasets: one for English and another for Bengali. The experimental results show that the performance of the proposed system is comparable to some state-of-the art summarization systems.
Article Preview

Introduction

The World Wide Web (WWW) has evolved as an information hub on the Internet. It manages the enormous flood of information and presents it to the Internet users (Ahmed & Mahmood, 2012). Text summarization tools can be an indispensible solution to the information overload problem because text summarization tools can be useful for generating a quick overview of an entire document, document indexing, question-answering systems, retrieval system, etc.

Depending upon size of the input to a summarization process, automatic text summarization can be of types: single document summarization and multidocument summarization. When only one document is the input, it is called single document text summarization and when the input is a set of related text documents, it is called multi-document summarization. A summary is produced in the form of an abstract or an extract. An extract is a summary consisting of a number of important textual units selected from the input(s) (Mani, 2001). An abstract is a summary, which represents the subject matter of an input(s) with the text units, which are generated by reformulating the salient units selected from an input(s).

Multilingual summarization is defined by Mani (2001) as “processing several languages, with summary in the same language as input. Since the summarization approach presented in this paper can process both English and Bengali documents and produce a summary in the same language as input, the summarization approach proposed by us can be regarded as a multi-lingual summarization approach.

The earliest works on text summarization use features such as sentence position, word importance, cue phrases, title information and sentence length for ranking sentences (Baxendale, 1958; Luhn, 1958; Edmundson, 1969; Sarkar, 2012a; Sarkar, 2012b). The approach presented in (Hewahi, & Abu Kwaik, 2012) uses multi-word terms such as entity objects (names, places) and specialized terminologies along with some semantic features specific to Arabic language for Arabic text summarization.

The centroid based summarization system (Radev, Jing, Styś, & Tam, 2004) assigns weights to the sentences based on the similarities of the sentences to centroid where centroid is represented by a set of single word terms whose weights are greater than a predefined threshold. The centroid based approach differs from our approach because the centroid based approach ignores the uses of multi-word keyphrases for text summarization and it does not also use keyphrases for maximizing diverse but relevant information in the summary.

Many research works on keyphrase extraction (Turney, 2000; Sarkar, Nasipuri, & Ghose, 2012) have marked text summarization, automatic indexing etc. as the application areas of keyphrase extraction. The work presented in Wu and Li (2008) considers keyphrases as document key concepts and incorporates document key concepts in search results. Unlike the above mentioned works that only concentrated on keyphrase extraction task, our proposed work investigates the uses of keyphrases (key concepts) in improving the summarization performance.

The previous works presented in D’Avanzo, and Magnini (2005) have used multi-word keyphrases for text summarization tasks. Our proposed summarization approach also differs from the summarization approach presented by D’Avanzo and Magnini (2005) because the approach presented in D’Avanzo and Magnini (2005) gives less emphasis on redundancy issue though redundancy is a crucial issue for text summarization tasks.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 10: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 9: 4 Issues (2018): 2 Released, 2 Forthcoming
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing