Article Preview
TopClustering in data mining is a discovery process that groups a set of data such that the intracluster similarity is maximized and the intercluster similarity is minimized (Chen, Han, & Yu, 1996). These discovered clusters can be used to explain the characteristics of the underlying data. Clustering has found many business applications, it can be used to identify different customer segments and allow businesses to offer them customized solutions, or to predict customer buying patterns based on the properties of the cluster to which they belong.
Many clustering algorithms exist for various type of target datasets, most of the previous clustering algorithms exist for numerical data whose inherent geometric properties can be naturally analyzed to find out the distance function between data points such as k-means, DBSCAN, CURE, Wave Cluster (Queen, 1967; Nanopoulos & Theodoridis, 2001; Ester et al., 1996; Zhang et al., 1996; Sheikholeslami et al., 1998). Most traditional clustering algorithms are limited in handling datasets that contain categorical attributes. Clustering algorithms for numerical attributes don’t work well for the categorical attributes due to their different properties. A few algorithms have been proposed in recent years for clustering categorical data (Guha et al., 1998; Karypis et al., 1999; Huang, 1997; Zhang et al., 2000; Gibson et al., 1998). He et al., (2003) have proposed a k-histogram algorithm for categorical data which extends the k-means algorithm by replacing the means of clusters with histograms and dynamically updates histograms in the clustering process.