Hybrid Approach for Single Text Document Summarization Using Statistical and Sentiment Features

Hybrid Approach for Single Text Document Summarization Using Statistical and Sentiment Features

Chandra Shekhar Yadav (Jawaharlal Nehru University, Delhi, India) and Aditi Sharan (Jawaharlal Nehru University, Delhi, India)
Copyright: © 2015 |Pages: 25
DOI: 10.4018/IJIRR.2015100104
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Summarization is a way to represent same information in concise way with equal sense. This can be categorized in two type Abstractive and Extractive type. Our work is focused around Extractive summarization. A generic approach to extractive summarization is to consider sentence as an entity, score each sentence based on some indicative features to ascertain the quality of sentence for inclusion in summary. Sort the sentences on the score and consider top n sentences for summarization. Mostly statistical features have been used for scoring the sentences. A hybrid model for a single text document summarization is being proposed. This hybrid model is an extraction based approach, which is combination of Statistical and semantic technique. The hybrid model depends on the linear combination of statistical measures: sentence position, TF-IDF, Aggregate similarity, centroid, and semantic measure. The idea to include sentiment analysis for salient sentence extraction is derived from the concept that emotion plays an important role in communication to effectively convey any message hence, it can play a vital role in text document summarization. For comparison, five system summaries have been generated: Proposed Work, MEAD system, Microsoft system, OPINOSIS system, and Human generated summary, and evaluation is done using ROUGE score.
Article Preview

1. Introduction

Text document summarization playing an important role in IR (Information Retrieval) because, it condense a large pool of information into a concise form, through selecting the salient sentences and discards redundant sentences (or information) and we termed it as summarization process.

Radev et al.(2002) has defined a summary as ary as “a text that is produced from one or more texts that convey important information in the original texts, and that is no longer than half of the original text and usually significant less than that”. As explained by Alguliev et al. (2011) Automatic text document summarization is an interdisciplinary research area of computer science that includes AI (artificial intelligence), Data Mining, Statistics as well as Psychology. We can classify text doc summarization in two ways (by techniques) Abstractive summarization and Extractive summarization. Abstractive summarization is more human like a summary, which is the actual goal of Text document summarization. As defined by Mani, I., & Maybury, M. T. (1999), Wan (2008) abstractive summarization needs three things as Information Fusion. Sentences Compression and Reformation. Abstractive summarization may contain new sentences, phrases, words even which are not present in the source document. Although till now a lot of research in happened in the last decades in the area of NLP (Natural language processing), NLG (Natural Language Generation), so much computing power increased, but still we are not near for abstractive summarization. The actual challenge is a generation of new sentences, new phases, along with produced summary must retain the same meaning as the same source document has. Extractive summarization based on extractive entities, entities may be sentence, sub part of sentence, phrase or a word. Our work is focused on extractive based technique.

In This paper we are proposing a hybrid method for single text document summarization, which is linear combination of statistical features as used in Ko. Y. & Seo, J (2008), Yeh, J. Y. et al. (2005), Radev, D.R. et al (2002, and Radev, D. R. (2001) ] and a new kind of semantic feature i.e. sentiment analysis. The idea which is used in this paper has been derived from different papers like for statistical features and their collective sum obtained from Ko, Y., & Seo, J. (2008), Yeh, J. Y. et al. (2005), centroid measure are taken from Radev, D. R. et al. (2001), Radev, D. R. et al. (2004) [7,6]. To include sentiment analysis is derived from the concept that emotion plays an important role in communication to effectively convey any message hence, it can play a vital role in text document summarization.

Outline of paper looks like, in section 2 we are presenting categorized literature work done in recent years, section 3 contains features used for summarization purpose, section 4 contain summarization algorithm and detail approach, in section 5 we are presenting corpus description with statistical and linguistic statistic, section 6 showing some experiments and results, in section 7 is about conclusion.

According to Aliguliyev, R. M. (2007) summarization is defined as a three steps process (1) Analysis of text. (2) Transformation- as summary representation, and (3) Synthesis- produce an appropriate summary. E Hovy, E., & Lin, C. Y. (1998) introduced SUMMARIST system to create a robust text summarization system, system that works on three phases which can describe in form of an equation like “Summarization = Topic Identification + Interpretation + Generation”.

A lot of research done in the direction of Extraction based approaches. In extractive summarization the important the task is to find informative sentences, a subpart of sentence or phrase and include these extractive elements into the summary. Here we are presenting work done in two categories (1) early work done and, (2) work done in recent years. In our views these are three works done initially, that provides direction of Text Document Summarization (Extractive), explained below

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing