Article Preview
TopIntroduction
Centroid-based clustering has a long history in numerical taxonomy and has been considered as one of the heavily used technique in exploratory data mining. Clustering (Jain, 2010) has become one of the common techniques for statistical data analysis and applied in many fields such as machine learning, pattern recognition, information retrieval, image analysis, bio-informatics, computational finance, systems engineering, and social networking. Cluster analysis involves finding a specific number (K) of subgroups, known as clusters, representing high intra-cluster homogeneity and increasing inter-cluster dissimilarity within a set of N observations (data points / samples / objects); where each sample S is described by D features. In centroid-based clustering, each cluster CK is represented using a center vector of size D, representing center of the cluster, which may not necessarily be a member of observed data points; and each observation is assigned to one of the clusters (exclusive assignment or crisp/hard assignment) or in part to many clusters (partial assignment or fuzzy assignment or soft assignment). Mathematically, a K-clustering of a data set X = {x1, . . ., xN} is the partition of X into K sets (clusters), C1, ...,CK such that the following three conditions are met:
- 1.
Ci ≠ ϕ, i = 1, ...,K
2.
- 3.
Ci ∩ Cj = ϕ, i ≠ j, i,j = 1, ...,K.
This is known as hard clustering of K-clusters. In fuzzy clustering the data items are assigned membership values for each cluster [0 1]. Additionally, in fuzzy clustering with K clusters Ci ∩ Cj ≠ ϕ, i ≠ j, i,j = 1, ...,K (i.e., a sample point may belongs to every clusters with a certain degree of membership).
Centroid-based clusters can be generated by some of the well-known techniques such as K-means: representing each cluster by a single mean vector (Hartigan & Wong, 1979; Jain, 2010; Krishna & Narasimha Murty, 1999); K-medoids: restricting the centroids to members of the data points (Park & Jun, 2009; Rousseeuw & Kaufman, 1990); K-medians: choosing medians (Anderson et al., 2008); K-means++: choosing the initial centers less randomly (Arthur & Vassilvitskii, 2007); or fuzzy K-means: allowing a fuzzy cluster assignment (De Oliveira & Pedrycz, 2007; Kruse et al., 2007).
Clustering techniques are considered as effective knowledge exploration techniques but at the same time understanding the relation between the generated clusters and it’s data points are also important. Real life data are most often represented in high dimensional space and hence the inherent similarities are hard to recognize and illustrate. This fact makes it a challenging task to build tools which can visualize the similarities and relationships between features. By nature visualization requires a mapping process from a high-dimensional input space to low-dimensional output space.
In this paper, existing cluster visualization and knowledge exploration techniques have been presented first, than we introduce a new visualization technique (MVClustViz) with the help of traditional bar visualization. MVClustViz is capable to visualize large-scale, multidimensional datasets in a single view, and able to produce quick overview about the dataset. The technique is also capable to visualize complete information of overlapped clusters. Further, fuzzy clusters can be visualized using this technique without a predefined fuzzy cutup value. Therefore, it increases the flexibility of analysis. We have discussed different visualization possibilities for MVClustViz and lastly we have discussed few results of interpolation techniques.