Exploiting Semantic Term Relations in Text Summarization

Exploiting Semantic Term Relations in Text Summarization

Kamal Sarkar, Santanu Dam
Copyright: © 2022 |Pages: 18
DOI: 10.4018/IJIRR.289607
Article PDF Download
Open access articles are freely available for download

Abstract

The traditional frequency based approach to creating multi-document extractive summary ranks sentences based on scores computed by summing up TF*IDF weights of words contained in the sentences. In this approach, TF or term frequency is calculated based on how frequently a term (word) occurs in the input and TF calculated in this way does not take into account the semantic relations among terms. In this paper, we propose methods that exploits semantic term relations for improving sentence ranking and redundancy removal steps of a summarization system. Our proposed summarization system has been tested on DUC 2003 and DUC 2004 benchmark multi-document summarization datasets. The experimental results reveal that performance of our multi-document text summarizer is significantly improved when the distributional term similarity measure is used for finding semantic term relations. Our multi-document text summarizer also outperforms some well known summarization baselines to which it is compared.
Article Preview
Top

Introduction

Information overload is a critical problem on the Internet. For managing information load on the Internet, one of the most effective mechanisms is text summarization. Text summarization reduces the input document(s) to a summary which is a condensed version of the input. Summarization can help users in many ways-(1) it enables readers to quickly understand what the document(s) is about, (2) summaries can be presented with the search results to help users in understanding whether the linked documents are relevant or not, (3) bandwidth can be saved if instead of sending the whole document, a summary is sent first to the small screen devices. Summaries are also useful in many other applications such as text clustering and classification. Though many researchers have been trying to find solutions to text summarization problem for the last many years, there is further scope for doing research for finding better solution to the problem. The reason is that text summarization is a human ability which is very difficult to model.

According to the previous research works (Goldstein et al., 2000; Gupta & Siddiqui, 2012; Sarkar, 2014; Sarkar, 2009a), summary can be of two types: extract and abstract. An extract is a summary created by selecting text segments (sentences) from the input whereas an abstract is a summary which is created by reformulation of text segments selected from the summary. An abstract may contain some words which are not present in the input. The most existing abstractive summarization methods deal with generating very short or ultra-summary (Sarkar & Bandyopadhyay, 2005; Zajic, Dorr & Schwart, 2002; Rush, Chopra & Weston, 2015; Nallapati et al.,2016; Nallapati, Zhai, & Zhou, 2017). In this paper, we focus on generating extractive multi-document summaries. Multi-document summaries are relatively longer than very short summaries.

The most previous works on extraction based summarization are sentence ranking based. The sentence ranking based approach ranks sentences based on scores where score of a sentence is calculated by combining various feature based scores such as frequency of terms, sentence position and/or cue phrases (Luhn, 1959; Sarkar, 2009b; Sarkar, Nasipuri & Ghosh, 2011). After ranking sentences, the top n sentences are chosen based on the compression ratio.

Centroid based summarization approach (Radev et al., 2004) is also an extractive approach to multi-document summarization that ranks sentences based on their similarities to the centroid which is created by choosing a set of most important words from the input cluster of documents. Here word importance is measured by the TF*IDF weight. In this approach, calculation of TF or term frequency is based on how frequently a term (word) occurs in the input.

Not only the centroid-based approach, but many other summarization approaches also use term frequency calculation which considers syntactic term matching without taking into account the semantic relations among terms. Due to this problem, the traditional TF*IDF based sentence-ranking approach places some summary worthy sentences in the rank positions far from the top ranked sentences and so, those sentences are not selected into the summary due to a predefined summary length restriction.

Since the input to a multi-document summarizer is a set of related documents, a multi-document summary may contain redundancy. Redundancy is one of the crucial factors in multi-document summarization because redundant information in a summary makes the summary less informative. Maximal marginal relevance is a popular technique for redundancy removal while selecting top n sentences for creating an extract (Carbonell & Goldstein, 1998). Maximal marginal relevance technique uses cosine similarity between sentences for identifying similar sentences where each sentence is represented using TF*IDF based Bag-of-words model. Hence term mismatch problem leads to data sparseness problem which affects redundancy removal performance.

So, other than above mentioned cases, there are a number of existing extractive summarization methods (mentioned in the subsequent section) that use TF*IDF based Bag-of-Words model for text representation where a term weight is calculated using the traditional TF*IDF method.

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing