Dynamic Document Clustering Using Singular Value Decomposition

Dynamic Document Clustering Using Singular Value Decomposition

Rashmi Nadubeediramesh (Department of Information Systems, University of Maryland Baltimore County (UMBC), Baltimore, MD, USA) and Aryya Gangopadhyay (Department of Information Systems, University of Maryland Baltimore County (UMBC), Baltimore, MD, USA)
DOI: 10.4018/jcmam.2012070103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Incremental document clustering is important in many applications, but particularly so in healthcare contexts where text data is found in abundance, ranging from published research in journals to day-to-day healthcare data such as discharge summaries and nursing notes. In such dynamic environments new documents are constantly added to the set of documents that have been used in the initial cluster formation. Hence it is important to be able to incrementally update the clusters at a low computational cost as new documents are added. In this paper the authors describe a novel, low cost approach for incremental document clustering. Their method is based on conducting singular value decomposition (SVD) incrementally. They dynamically fold in new documents into the existing term-document space and dynamically assign these new documents into pre-defined clusters based on intra-cluster similarity. This saves the cost of re-computing SVD on the entire document set every time updates occur. The authors also provide a way to retrieve documents based on different window sizes with high scalability and good clustering accuracy. They have tested their proposed method experimentally with 960 medical abstracts retrieved from the PubMed medical library. The authors’ incremental method is compared with the default situation where complete re-computation of SVD is done when new documents are added to the initial set of documents. The results show minor decreases in the quality of the cluster formation but much larger gains in computational throughput.
Article Preview

Introduction

Educational institutions, industries, organizations and government agencies allocate substantial resources on research and development activities. Researchers around the world are working to devise methods and develop approaches to solve current problems, predict and prevent future concerns. However storing, updating, retrieving and grouping literature pertaining to various research topics has been a major concern. Much work has been done on dimensionality reduction techniques and information retrieval process. Digital libraries must be capable of storing and delivering trillions of bytes of data to millions of users. Our goal is to address these key issues and provide an effective solution that results in reduction of storage space, retrieval time, memory usage and computation time and hence increase the overall performance. To achieve this goal we:

  • Develop an incremental method for document clustering using singular value decomposition;

  • Demonstrate the accuracy of the cluster formation and maintenance in an incremental manner;

  • Empirically demonstrate the scalability of our proposed method using medical abstracts from Pubmed.

Complete Article List

Search this Journal:
Reset
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing