An Improved Genetic Algorithm for Document Clustering on the Cloud

An Improved Genetic Algorithm for Document Clustering on the Cloud

Ruksana Akter (Hankuk University of Foreign Studies, Seoul, Korea) and Yoojin Chung (Hankuk University of Foreign Studies, Seoul, Korea)
Copyright: © 2018 |Pages: 9
DOI: 10.4018/IJCAC.2018100102

Abstract

This article presents a modified genetic algorithm for text document clustering on the cloud. Traditional approaches of genetic algorithms in document clustering represents chromosomes based on cluster centroids, and does not divide cluster centroids during crossover operations. This limits the possibility of the algorithm to introduce different variations to the population, leading it to be trapped in local minima. In this approach, a crossover point may be selected even at a position inside a cluster centroid, which allows modifying some cluster centroids. This also guides the algorithm to get rid of the local minima, and find better solutions than the traditional approaches. Moreover, instead of running only one genetic algorithm as done in the traditional approaches, this article partitions the population and runs a genetic algorithm on each of them. This gives an opportunity to simultaneously run different parts of the algorithm on different virtual machines in cloud environments. Experimental results also demonstrate that the accuracy of the proposed approach is at least 4% higher than the other approaches.
Article Preview
Top

1. Introduction

The objective of document clustering is to partition a set of available documents into some groups, which are commonly referred to as clusters, so that similar documents are put together in the same clusters while dissimilar ones are put into different clusters (Agrawal & Agrawal, 2015). Document clustering is utilized in many applications dealing with information retrieval (Song & Park, 2009).

In this era of the Internet, cloud and Big data, we witness a large amount of data are being added to the web every moment (Muppidi & Murty, 2015). Such growth has necessitates a requirement of quickly analyzing the big amount of datasets (Esteves, Hacker, & Rong, 2014). Since most of these data are unstructured, we need to depend on the search engines to search for the desired materials. A search engine tries to find similar materials upon getting a query from a user. Hence, when similar materials (e.g. data files) are clustered together beforehand, the search engine can efficiently find them. For this reason, document clustering has attracted many researchers in recent years (Song et al., 2009).

Many different works have already been done by the researchers in order to cluster text documents (Koontz, Narendra, & Fucunaga, 1976; Frigui & Krishnapuram, 1999). K-means (Jain, Murty, & Flynn, 1999; Jain, 2008; Jain & Dubes, 1988; Duda, Hart, & Stork, 2000; Cha & Kwon, 2001) clustering is a very popular clustering algorithm, which partitions the available documents into K different groups. In spite of the simplicity and efficiency of the K-means clustering algorithm, the need of specifying the value of K (i.e., the number of clusters that should be made) in advance is often posed as a difficult problem. Moreover, the quality of the outcome of the clustering heavily depends on the initial clustering, usually set at random at the beginning of the algorithm (Cha et al., 2001).

Population-based evolutionary algorithms, such as ant clustering (Vizine, Castro, Hruschka, & Gudwin, 2005) and genetic algorithm (Song et al., 2009), are also being utilized in the document clustering problem domain (Agrawal et al., 2012). Especially, a number of researches are seen to use genetic algorithm for document clustering (Song et al., 2009). It is an optimization method, which follows the principles of evolution through randomized natural selection (Maulik & Bandyopadhyay, 2000). The genetic algorithm has been well accepted to find a good solution, especially in a large multi-modal space (Srinivas & Patnaik, 1994). However, it may lead to premature convergence phenomenon (PCP) (Andre, Siarry, & Dongon, 2001) to local extreme. The Double Layered Genetic algorithm for document Clustering (DLGC) (Choi, Lee, & Park, 2013) tries to avoid PCP. However, the required computation becomes very high, and DLGC also demands specifying the number of desired clusters.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 10: 4 Issues (2020): 1 Released, 3 Forthcoming
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing