KD-Tree Based Clustering for Gene Expression Data

KD-Tree Based Clustering for Gene Expression Data

Damodar Reddy Edla (National Institute of Technology, Goa, India), Prasanta K. Jana (Indian School of Mines, India) and Seshaiah Machavarapu (Tata Consultancy Services, India)
Copyright: © 2014 |Pages: 15
DOI: 10.4018/978-1-4666-5202-6.ch122

Chapter Preview



K-means is one of the widely researched clustering algorithms. But it is sensitive to the selection of initial cluster centers and estimation of the number of clusters. In this chapter, we propose a novel approach to find the efficient initial cluster centers using kd-tree and compute the number of clusters using joint distance function. We have carried out excessive experiments on various synthetic as well as gene expression data. Dunn validity index is used to examine the quality of the clusters in case of multi-dimensional gene expression data. The experimental results are compared with the existing techniques using the Dunn validity index and number of iterations.


Main Focus

K-means (MacQueen, 1967) is a well known partitional clustering algorithm. The clusters of K-means are represented by iteratively-changing centroids chosen randomly. K-means find the squared distances between these centers and the given objects to assign the objects to their closer centroids. However, K-means has the demerit of the random selection of initial cluster centers. It also has the problem of estimating the number of clusters. Many researches proposed various methods to overcome these problems, a good review of which can be seen from Jain (2010) and Al-Daoud and Roberts (1996).

Motivated with them, we propose an algorithm to enhance the K-means clustering. This algorithm has two phases. In the first phase, we use the approach of kd-tree to find the efficient initial seeds for K-means clustering. Then, in the next phase we use the joint distance function (Butenko, Chaovalitwongse, & Pardalos, 2009) to find the right number of clusters. We have carried out excessive experiments on various synthetic as well as gene expression (GE) data. The results are compared with classical K-means (MacQueen, 1967), improved K-means (Geraci et al., 2007), CCIA (Khan & Ahmad, 2004) and fuzzy C-means (Bezdek, Ehrlich, & Full, 1984) algorithms. Dunn validity index (Halkidi, Batistakis, & Vazirgiannis, 2001) is used to examine the quality of the clusters of multi-dimensional data. Finally, the proposed method is compared with the existing methods using number of iterations.

Key Terms in this Chapter

Gene Expression Data: The data produced by the translation of information encoded in a gene into protein or RNA structures that are present and operating in the cell.

Clustering: A process of partitioning the given data points into homogeneous groups called clusters.

Joint Distance Function: A function to find the optimal number of clusters of the given data set.

K-Means: A well known partitional clustering algorithm which depends on the random initialization of the cluster centers.

Cluster Center: A point to represent central location (usually mean) of the cluster. Cluster centers have been used to represent the points of its cluster.

kd-tree: A space partitioning technique to arrange the given objects in a tree structure of k -dimensional space.

Dunn Validity Index: A measure proposed by Dunn to judge the quality of the clusters based on diameter and dispersion.

Complete Chapter List

Search this Book: