A well known issue with prototype-based clustering is the user’s obligation to know the right number of clusters in a dataset in advance or to determine it as a part of the data analysis process. There are different approaches to cope with this non-trivial problem. This chapter follows the approach to address this problem as an integrated part of the clustering process. An extension to repulsive fuzzy c-means clustering is proposed equipping non-Euclidean prototypes with repulsive properties. Experimental results are presented that demonstrate the feasibility of the authors’ technique.
TopIntroduction
Clustering has become a very popular technique to discover interesting patterns in complex data. Due to its clear output, results are easily interpretable by all audience. It is thus not surprising that clustering is applied in various domains, e.g. the analysis of gene expression data, customer behavior, air traffic management and many more (Raytchev and Murase, 2001, Frigui and Krishnapuram, 1999; Ressom et al., 2003; Rehm and Klawonn, 2005). The purpose of clustering is to divide a dataset into different groups or clusters such that elements of the same cluster are as similar as possible and elements of different clusters are as dissimilar as possible (Duda and Hart, 1973; Bezdek, 1981). It is generally applied to data where no class labels are assigned to the single entities. In fact, the intention of using clustering is to gain this class information as a result of the clustering process. It is therefore known as unsupervised classification.
Most clustering algorithms can be categorized into hierarchical clustering and partitional clustering. Hierarchical clustering groups data over a variety of scales by constructing a cluster tree. This tree represents a multilevel hierarchy, where clusters at one level are joined as clusters at the next level (Duda and Hart, 1973). This allows to decide the scale of clustering that is most appropriate for the respective application. Hierarchical clustering either builds a hierarchy of clusters bottom-up (agglomerative), starting with each sample as a cluster and forming a sequence by successively merging clusters, or splits clusters top-down (divisive), starting with all samples in on cluster and successively separating the data and forming a sequence of partitions (Duda and Hart, 1973). Partitional clustering attempts to directly decompose the dataset into a set of disjoint clusters that ideally comply with the natural grouping present in the data.
Prototype-based clustering algorithms represent the most popular class of partitional clustering techniques. The nature of prototype-based clustering is that, as a result, some representatives, the so-called prototypes, typify a subset of data objects by its position in the center of the respective data cluster. Typically, the number of data clusters is not known in advance but must be specified when applying prototype-based clustering algorithms. In fact, the determination of the exact number of clusters is a difficult problem. Most clustering algorithms can partition a dataset into any specified number of clusters even if the data contain no cluster structure (Jain and Moreau, 1987). Numerous cluster validity measures, procedures for determining the number of clusters, have been proposed. Global cluster validity measures mostly utilize a kind of square error criterion and condense the clustering result to a scalar value after the clustering process which is associated with a huge loss of information. Local cluster validity measures try to estimate the optimal number of clusters as an integrated part the clustering process. These techniques mostly over specify the number of clusters for the initial partition and the final one has the optimal number of clusters (Timm et al., 2001; Krishnapuram and Freg, 1992; Xiong et al., 2004). Another approach to assess cluster validity is to visualize the resulting cluster partition and inspect it visually (Hathaway and Bezdek, 2003; Hathaway et al., 2006; Havens et al., 2008; Klawonn et al., 2003; Rehm et al., 2006). Mostly, several runs with various parameter sets must be performed in order to find a suitable solution. Besides that, initialization may have a considerable impact on the clustering result. Unfortunately, no holistic solution for these problems can be provided until now. However, if certain knowledge about the data is available, e.g. what will be the approximate size of the clusters and how far are they separated, clustering algorithms can use these information to reduce user load doing expert working, e.g. in finding parameters, and finally improve clustering results.