A Summarizer for Tamil Language Using Centroid Approach

A Summarizer for Tamil Language Using Centroid Approach

Syed Sabir Mohamed (Research Scholar, Faculty in Computer Science & Engineering, Sathyabama University, Chennai, India) and Shanmugasundaram Hariharan (Department of Computer Science and Engineering, TRP Engineering College, Tiruchirappalli, India)
Copyright: © 2016 |Pages: 15
DOI: 10.4018/IJIRR.2016010101


Document summarization plays a vital role in the use and management of information dissemination. This paper investigates a method for the production of summaries from Tamil newspaper text document. The primary goal is to create an effective and efficient tool that is able to summarize the given text documents in a form of meaningful extract of the original text document using centroid-based algorithm. The paper focuses on generating summaries using a centroid-based algorithm, which represents group of words that are statistically important for a document. Each sentence in a document is considered as a vector in a multi-dimensional space. The sentences that are nearest to the centroid value are considered as the most important sentences. The importance of a sentence is determined by three parameters the centroid value, the positional value, and the first sentence overlap. The score for each sentence is calculated and the redundancy between the sentences is eliminated using CSIS. Finally, the sentences are ranked and the sentences with highest score values are selected as summary.
Article Preview

2. Literature Survey

The paper focus on summarizing Persian language, which the authors have termed as AZOM. The proposed approach combines statistical and conceptual property of text in regards of document structure that extracts summary of text. AZOM is capable of summarizing unstructured documents. The result presented is also much superior than common structured text summarizers and much better than other Persian text summarizers (Zamanifar & Kashefi, 2011).

Rapid growth of Internet has led to the emergence of large of amount of data. India has diversified languages and hence forth summarization technique differs and so as generated summaries. The authors have proposed a system for detection and removal of deadwood (meant as word or phrase that can be omitted without any loss) for Punjabi language. The steps involves sentence segmentation and removal of stop words which is considered to be the preprocessing task. The second task is assigning weights taking into account five different features to the textafter tokenization. Sentence with higher scores are eliminated and finally deadwood is eliminated and removed from summary.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 9: 4 Issues (2019): 1 Released, 3 Forthcoming
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing