Optimized Clustering Techniques with Special Focus to Biomedical Datasets

Optimized Clustering Techniques with Special Focus to Biomedical Datasets

Anusuya S. Venkatesan (Saveetha University, India)
DOI: 10.4018/978-1-5225-0660-7.ch015


The clinical data including clinical test results, MRI images and drug responses of patients are documented and analyzed with machine learning and data mining tools. The scale and complexity of these datasets is a big challenge to machine learning and data mining community as the data is of mixed type. The extraction of meaningful or desired information from these datasets provides knowledge in decision making process which in turn helps for the diagnosis and treatment of the diseases. Biomedical datasets are a collection of data with diverse types as it involves images, clinical studies, statistical reports etc. The recent researches have focused on different clustering and classification methods to manage and analyze the biomedical datasets. The objective of this chapter is to cluster or classify the patterns of interest from Brain MRI images, Liver disorder and Breast cancer datasets using efficient clustering methodologies. Among the different algorithms in data mining for clustering, classification, visualization and interpretation, K Means, Fuzzy C Means and Neural Networks(NN) are frequently used for clustering and classification of biomedical datasets. The performance of these methods are greatly influenced by the initialization of K value and its convergence speed. This chapter discusses about FCM and K Means clustering methods and its optimization with meta heuristics such as Particle Swarm Optimization (PSO) and Quantum Particle Swarm Optimization (QPSO). The experimental section of this paper exhibits analysis in terms of Intra cluster distances, elapsed time and Davis Bouldin Index (DBI).
Chapter Preview


Data mining is an interdisciplinary area involves artificial intelligence, soft computing, database system etc. The tools of data mining infer information from the databases and this information converted to knowledge of patterns and relationships. The relationships among data are referred as Classes, Clusters, Associations and Sequential Patterns. Data clustering have been applied in the area of data mining and machine learning. The specific applications include statistics (McLachlan et al., 1997), bioinformatics (He et al. 2006), machine learning (Ethem Alpaydin., 2004) exploratory data analysis, image segmentation, security, medical image analysis, web handling and mathematical programming (Pyle 1999; Panov et al. 2008). The process of clustering split the data into homogeneous and inhomogeneous classes with respect to similarity between data.

Clusters are formed by finding the distance between data points. The existing tools explore the data and help to visualize it in different models. The cluster representation is one of the widely accepted exploratory models to analyze the data with the different levels of observation. Clustering is applied to biomedical datasets to understand the characteristics of bio information and to find interesting patterns associated to prior information. Most of the bio medical datasets have inherent noise and inconsistency, sometimes mixed with semantic information and experimental results. Hence, generating quality clusters on biomedical datasets is a challenging task. The role of clustering in biomedical datasets is to derive meaningful information which assists pathologists on decision making.

In case of medical image segmentation, the system works by segmenting the whole image into multiple segments and extracts only the specific region for investigation. In image segmentation, the intensities of all pixels within a homogenous cluster are similar but the intensities of inhomogeneous clusters are different from homogeneous one. Medical image analysis is mainly dependent on effective image segmentation to extract suspicious regions from complex medical images (Neeraj et al., 2010).

Clustering is categorized as an optimization problem to satisfy the criteria of minimizing the similarity within a cluster and maximizing the dissimilarity between clusters. Table 1 shows the different distance measures used to find the distance between data points. Some clustering techniques use heuristic algorithms (Bandyopadhyay S, 2002)(Das S et al.,2008)(S. Ouadfel et al.,2010) to obtain centres for clusters. The objectives of using Optimization techniques with clustering techniques are 1) to find global optima 2) to enforce robustness against initialization 3) to improve the partitioning quality 4) to deal with unknown and known number of clusters 5) to speed up convergence etc. Clusters are represented in different forms such as Connectivity models, Centroid models, Distribution models, Density models and Graph-based models.

Table 1.
The various distance measures of clustering
Distance MeasuresFormulaComments
Euclidean978-1-5225-0660-7.ch015.m01Computes the square root of the sum of the squares of the differences between corresponding values.
City block978-1-5225-0660-7.ch015.m02Computes the sum of the differences of their corresponding components.
Cosine978-1-5225-0660-7.ch015.m03The Cosine Similarity takes into only the angle and discards the magnitude.
Pearson correlation978-1-5225-0660-7.ch015.m04
The Pearson Correlation Distance computes the distance of each point from the linear regression line.
Minkowski Distance978-1-5225-0660-7.ch015.m06Minkowski Distance is a generalization of Euclidean and Manhattan distance.
Manhattan distance978-1-5225-0660-7.ch015.m07Manhattan distance represents distance that is measured along directions that are parallel to the x and y axes.
Chebychev distance978-1-5225-0660-7.ch015.m08Chebychev distance simply picks the largest difference between any two corresponding coordinates.

Complete Chapter List

Search this Book: