News Document Summarization Driven by User-Generated Content

News Document Summarization Driven by User-Generated Content

Luca Cagliero (Politecnico di Torino, Italy) and Alessandro Fiori (IRC@C: Institute for Cancer Research and Treatment at Candiolo, Italy)
Copyright: © 2013 |Pages: 22
DOI: 10.4018/978-1-4666-2806-9.ch007

Abstract

Experiments performed on real collections of news articles and driven by on-topic Twitter posts show the effectiveness of the proposed approach.
Chapter Preview
Top

Introduction

The increasing availability of Web documents (e.g., news articles, scientific papers, books, and magazines) and the popularity of the social network communities, such as Twitter and Facebook, have relevantly changed the life style of Web users. Nowadays, social network sites help users to find people with similar interests and goals, provide means of news exchange, and facilitate multimedia content sharing. Furthermore, the huge amount of news document collections available on the Web represents a powerful source of knowledge for both industrial and academic purposes.

An interesting research direction focuses on conveying the huge mass of electronic document content into concise representations, i.e., the summaries. Multi-document summarization addresses the selection of the most relevant and not redundant sentences belonging to a collection of textual documents. Previous approaches commonly rely on either information retrieval (Carenini, Ng, & Zhou, 2007; Radev, 2004) or data mining approaches (Thakkar, Dhareskar, & Chandak, 2010; Wang & Li, 2010; Wang, Zhu, Li, Chi, & Gong, 2011). In fact, most of them are based on (1) clustering algorithms (e.g., Thakkar, et al., 2010; Wang & Li, 2010; Wang, et al., 2011), (2) graph-based methods (e.g., Radev, 2004), or (3) linear programming algorithms (e.g., Takamura & Okumura, 2009b). While clustering is exploited to group sentences belonging to a document collection, graph-based methods try to represent correlations among sentences by means of a graph-based model. To select most representative sentences according to the generated model, well-established graph-based indexing strategies are usually exploited (Radev, 2004). Differently, linear programming approaches formalize the summarization problem as a min-max optimization problem (Takamura & Okumura, 2009b). However, all the aforementioned approaches consist in general-purpose summarizers applicable to document collections coming from any source and, thus, do not consider the real social interest to effectively accomplish the summarization task.

The outstanding growth of online communities and social networks has made available to analysts a powerful and huge amount of User-Generated Content (UGC). Some preliminary attempts to convey the information provided by user-generated content into the document summarization process have been performed. Previous works entail (1) the exploitation of the Wikipedia content to identify key concepts in document collections (Gong, Qu, & Tian, 2010; Miao & Li, 2010), (2) the use of social annotations (tags) in graph-based text summarization (Zhu, et al., 2009), and (3) the learning of classification models driven by the main social network data features to evaluate document sentences (Yang, et al., 2011). However, they still show a limited effectiveness due to (1) the hardness in capturing most significant correlations among multiple terms at the same time and (2) the low quality of the training data used for sentence selection in supervised methods.

Complete Chapter List

Search this Book:
Reset