Article Preview
Top1. Introduction
The objective of document clustering is to partition a set of available documents into some groups, which are commonly referred to as clusters, so that similar documents are put together in the same clusters while dissimilar ones are put into different clusters (Agrawal & Agrawal, 2015). Document clustering is utilized in many applications dealing with information retrieval (Song & Park, 2009).
In this era of the Internet, cloud and Big data, we witness a large amount of data are being added to the web every moment (Muppidi & Murty, 2015). Such growth has necessitates a requirement of quickly analyzing the big amount of datasets (Esteves, Hacker, & Rong, 2014). Since most of these data are unstructured, we need to depend on the search engines to search for the desired materials. A search engine tries to find similar materials upon getting a query from a user. Hence, when similar materials (e.g. data files) are clustered together beforehand, the search engine can efficiently find them. For this reason, document clustering has attracted many researchers in recent years (Song et al., 2009).
Many different works have already been done by the researchers in order to cluster text documents (Koontz, Narendra, & Fucunaga, 1976; Frigui & Krishnapuram, 1999). K-means (Jain, Murty, & Flynn, 1999; Jain, 2008; Jain & Dubes, 1988; Duda, Hart, & Stork, 2000; Cha & Kwon, 2001) clustering is a very popular clustering algorithm, which partitions the available documents into K different groups. In spite of the simplicity and efficiency of the K-means clustering algorithm, the need of specifying the value of K (i.e., the number of clusters that should be made) in advance is often posed as a difficult problem. Moreover, the quality of the outcome of the clustering heavily depends on the initial clustering, usually set at random at the beginning of the algorithm (Cha et al., 2001).
Population-based evolutionary algorithms, such as ant clustering (Vizine, Castro, Hruschka, & Gudwin, 2005) and genetic algorithm (Song et al., 2009), are also being utilized in the document clustering problem domain (Agrawal et al., 2012). Especially, a number of researches are seen to use genetic algorithm for document clustering (Song et al., 2009). It is an optimization method, which follows the principles of evolution through randomized natural selection (Maulik & Bandyopadhyay, 2000). The genetic algorithm has been well accepted to find a good solution, especially in a large multi-modal space (Srinivas & Patnaik, 1994). However, it may lead to premature convergence phenomenon (PCP) (Andre, Siarry, & Dongon, 2001) to local extreme. The Double Layered Genetic algorithm for document Clustering (DLGC) (Choi, Lee, & Park, 2013) tries to avoid PCP. However, the required computation becomes very high, and DLGC also demands specifying the number of desired clusters.