Towards High Performance Text Mining: A TextRank-based Method for Automatic Text Summarization

Towards High Performance Text Mining: A TextRank-based Method for Automatic Text Summarization

Shanshan Yu (College of Medical Information Engineering, Guangdong Pharmaceutical University, Guangzhou, China), Jindian Su (College of Computer Science and Engineering, South China University of Technology, Guangzhou, China), Pengfei Li (College of Computer Science and Engineering, South China University of Technology, Guangzhou, China) and Hao Wang (Norwegian University of Science and Technology in Aalesund, Aalesund, Norway)
Copyright: © 2016 |Pages: 18
DOI: 10.4018/IJGHPC.2016040104
OnDemand PDF Download:
$37.50

Abstract

As a typical unsupervised learning method, the TextRank algorithm performs well for large-scale text mining, especially for automatic summarization or keyword extraction. However, TextRank only considers the similarities between sentences in the processes of automatic summarization and neglects information about text structure and context. To overcome these shortcomings, the authors propose an improved highly-scalable method, called iTextRank. When building a TextRank graph in their new method, the authors compute sentence similarities and adjust the weights of nodes by considering statistical and linguistic features, such as similarities in titles, paragraph structures, special sentences, sentence positions and lengths. Their analysis shows that the time complexity of iTextRank is comparable with TextRank. More importantly, two experiments show that iTextRank has a higher accuracy and lower recall rate than TextRank, and it is as effective as several popular online automatic summarization systems.
Article Preview

1. Introduction

It is commonly agreed that we are in the era of big data (Wang et al. 2015). Among various types of data, texts are the most common and pervasive all over the network. Although many effective technologies such as distributive or parallel computations have been proposed, e.g., MapReduce (Slagter et al. 2013; Salgter, et al. 2015; Salgter, et al. 2015), the information overload problem is getting worse as the quantity of data keep increasing rapidly. Automatic text summarization arises as an effective technology for producing a concise and fluent summary conveying the key information in the original text document (Nenkova & McKeown, 2012). Currently, high performance automatic summarization has already become a very important topic in the area of machine learning and data mining, and it is widely used in a large number of industrial sectors, especially in search engines such as Google, Baidu, Yahoo and news portals such as BBC, CNN and NBC News. Many researchers have developed various word-based, sentence-based and graph-based summarization methods. Among them, graph-based methods have attracted a lot of attentions. For example, Ferreira et al. (2013) proposed a four-dimension (including similarity, semantic similarity, co-reference and discourse information) graph model by taking co-reference resolution and the role of pronouns in connecting the sentences into consideration. See (Gupta &Lehal, 2010) and (Joshi & Sonawane, 2015) for more detailed surveys of extractive summarization techniques and graph-based methods.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing