Hamming Distance based Clustering Algorithm

Hamming Distance based Clustering Algorithm

Ritu Vijay (Bansthali University, India), Prerna Mahajan (Prerna Mahajan, Research Scholar, Banasthali University, India) and Rekha Kandwal (Ministry of Earth Sciences & Science and Technology, India)
Copyright: © 2012 |Pages: 10
DOI: 10.4018/ijirr.2012010102
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Cluster analysis has been extensively used in machine learning and data mining to discover distribution patterns in the data. Clustering algorithms are generally based on a distance metric in order to partition the data into small groups such that data instances in the same group are more similar than the instances belonging to different groups. In this paper the authors have extended the concept of hamming distance for categorical data .As a data processing step they have transformed the data into binary representation. The authors have used proposed algorithm to group data points into clusters. The experiments are carried out on the data sets from UCI machine learning repository to analyze the performance study. They conclude by stating that this proposed algorithm shows promising result and can be extended to handle numeric as well as mixed data.
Article Preview

Clustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized (Chen, Han, & Yu, 1996). These discovered clusters can be used to explain the characteristics of the underlying data. Clustering has found many business applications, it can be used to identify different customer segments and allow businesses to offer them customized solutions, or to predict customer buying patterns based on the properties of the cluster to which they belong.

Many clustering algorithms exist for various type of target datasets, most of the previous clustering algorithms exist for numerical data whose inherent geometric properties can be naturally analyzed to find out the distance function between data points such as k-means, DBSCAN, CURE, Wave Cluster (Queen, 1967; Nanopoulos & Theodoridis, 2001; Ester et al., 1996; Zhang et al., 1996; Sheikholeslami et al., 1998). Most traditional clustering algorithms are limited in handling datasets that contain categorical attributes. Clustering algorithms for numerical attributes don’t work well for the categorical attributes due to their different properties. A few algorithms have been proposed in recent years for clustering categorical data (Guha et al., 1998; Karypis et al., 1999; Huang, 1997; Zhang et al., 2000; Gibson et al., 1998). He et al., (2003) have proposed a k-histogram algorithm for categorical data which extends the k-means algorithm by replacing the means of clusters with histograms and dynamically updates histograms in the clustering process.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing