An Approach to Clustering of Text Documents Using Graph Mining Techniques

An Approach to Clustering of Text Documents Using Graph Mining Techniques

Bapuji Rao (BPUT, Rourkela, India) and Brojo Kishore Mishra (Department of Information Technology, C. V. Raman College of Engineering, Bhubaneswar, India)
Copyright: © 2017 |Pages: 18
DOI: 10.4018/IJRSDA.2017010103
OnDemand PDF Download:
No Current Special Offers


This paper introduces a new approach of clustering of text documents based on a set of words using graph mining techniques. The proposed approach clusters (groups) those text documents having searched successfully for the given set of words from a set of given text documents. The document-word relation can be represented as a bi-partite graph. All the clustering of text documents is represented as sub-graphs. Further, the paper proposes an algorithm for clustering of text documents for a given set of words. It is an automated system and requires minimal human interaction for the clustering of text documents. The algorithm has been implemented using C++ programming language and observed satisfactory results.
Article Preview

Literature Review

The Scatter-Gather method in (Cutting, Karger, Pedersen, & Tukey, 1992) says the hierarchical organization of documents into coherent categories for systematic browsing of the document collection. It provides a systematic browsing technique with the use of clustered organization of the document collection.

In the article by (Aggarwal & Zhai, 2012), the author says both feature selection and feature transformation methods such as Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA), and Non-negative Matrix Factorization (NMF) are used to improve the quality of the document representation and make it more efficient to text clustering. Feature selection is more common and easy to apply in text clustering in which supervision is available for the feature selection process proposed by (Yang & Pedersen, 1997). Since the results of text clustering are highly dependent on document similarity. Such cases the concept of term contributed by (Liu, Liu, Chen, & Ma, 2003) is applied. So the contribution of a term can be viewed as its contribution to document similarity.

The technique of concept decomposition uses any standard clustering technique has been studied in past studies (Aggarwal & Yu, 2001); (Dhillon, & Modha, 2001) on the original representation of the documents. The frequent terms in the centroids of these clusters are used as basis vectors which are almost orthogonal to one another. The documents can then be represented in a much more concise way in terms of these basis vectors. So the condensed conceptual representation allows for enhanced clustering as well as classification of text documents. Therefore, a second phase of clustering can be applied on this condensed representation in order to cluster the documents much more effectively by (Salton, 1983). Such a method is tested in (Slonim & Tishby, 2000) by using word-clusters in order to represent documents.

Complete Article List

Search this Journal:
Volume 8: 1 Issue (2022): Forthcoming, Available for Pre-Order
Volume 7: 4 Issues (2021): 1 Released, 3 Forthcoming
Volume 6: 3 Issues (2019)
Volume 5: 4 Issues (2018)
Volume 4: 4 Issues (2017)
Volume 3: 4 Issues (2016)
Volume 2: 2 Issues (2015)
Volume 1: 2 Issues (2014)
View Complete Journal Contents Listing