MOSAIC: Agglomerative Clustering with Gabriel Graphs

MOSAIC: Agglomerative Clustering with Gabriel Graphs

Rachsuda Jiamthapthaksin (University of Houston, USA), Jiyeon Choo (University of Houston, USA), Chun-sheng Chen (University of Houston, USA), Oner Ulvi Celepcikay (University of Houston, USA), Christian Giusti (University of Udine, Italy) and Christoph F. Eick (University of Houston, USA)
DOI: 10.4018/978-1-60566-748-5.ch010
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Strong theoretical foundation and low computational complexity make representative-based clustering one of the most popular approaches for a clustering problem. Despite those superiorities, it presents two main drawbacks: the shape of clusters obtained is limited to convex shapes, and its performance is highly dependent on seeds initialization. To address these problems, the authors introduce MOSAIC, a novel agglomerative clustering algorithm, which greedily merges neighboring clusters maximizing a plug-in fitness function. The key idea is that by considering neighboring relationship computed using Gabriel Graphs among cluster, MOSAIC can derive non-convex shapes as the unions of small clusters previously generated by a representative-based clustering algorithm. The authors evaluate MOSAIC for traditional unsupervised clustering with k-means and DBSCAN, and also for supervised clustering. The experimental results show that compared to k-means stand-alone, their proposed post-processing techniques obtain higher quality clusters, whereas compared to DBSCAN results, MOSAIC is capable of identifying comparable arbitrary shape clusters, given a suitable fitness function. In addition, MOSAIC can cope with problems of clustering on high dimensional data. The authors also claim that MOSAIC can be employed as an effective post-processing clustering algorithm to further improve the quality of clustering.
Chapter Preview
Top

1 Introduction

Representative-based clustering methods work by initially selecting a set of representatives and assigning each object of the dataset to the closest representative. The most popular representative-based clustering algorithm is k-means, which uses centroids of clusters as representatives and iteratively updates both clusters and centroids until no change occurs and the best local configuration is reached. k-means is widely used because of its low computational complexity, O(ktn) where k is the number of clusters, t is the number of iterations and n is the number of objects in the dataset. In spite of its fast computation, when using k-means we have to deal with three main problems. First, the number of clusters k has to be known prior to being used as a parameter. Second, the clusters obtained by using this technique are always convex and it is not possible to get good results when dealing with non convex shapes (Jiang, 2004) and third, the method is very sensitive to the initialization of representatives and also to the outliers.

Agglomerative hierarchical clustering (AHC) is one of the techniques capable of detecting clusters of arbitrary shapes (Tan, 2005, pp. 516-520). It merges the two closest clusters; because it does not consider any other merging candidates, the search space is very narrow, which results in missing many high quality solutions. Moreover, the computational complexity of this technique is O(n*n) or worse, therefore it is not applicable to very large datasets.

In this paper, we propose a new technique combining advantages of the representative-based clustering and agglomerative clustering. This hybrid clustering algorithm makes use of an externally given fitness function in order to greedily merge neighboring clusters. The neighboring relationship is computed by using Gabriel Graphs (Gabriel, 1969), one of the most popular proximity graphs, and non-convex shapes are created by merging a set of small convex clusters that have been generated using a representative-based clustering algorithm, as illustrated in Fig. 1. The art of creating mosaics refers to the procedure that assembles small tiles in order to create a sophisticated drawing; in the same way our method, called MOSAIC, merges small convex clusters together in order to obtain better clusters having a non-convex shape where needed.

Figure 1.

An illustration of MOSAIC’s approach © 2007 Springer. Used with permission.

By using proximity graphs, MOSAIC is able to conduct a very wide search, which in turn results in obtaining higher quality clusters. Moreover, the computationally expensive agglomerative clustering step is not run for the whole dataset (n), but only for the set of representatives (k) where k<<n. In this way, the agglomerative technique is executed for usually less than 1,000 iterations, reducing the overall computational complexity. Our technique is highly generic and uses the following components: the representative-based clustering, the proximity graph and the cluster evaluation function, which can be replaced by any similar method.

Fig. 2 gives the pseudo code of the proposed MOSAIC algorithm. First MOSAIC obtains a set of cluster representatives from a representative-based clustering algorithm. Next, it finds a merge-candidate relation by using a proximity graph. After that, MOSAIC iteratively merges a pair of merging candidates, which maximizes the given fitness function q, and then incrementally updates the merge-candidate relation. The algorithm terminates if there is no merging candidate left, and returns the best clustering found.

Figure 2.

Pseudo code for MOSAIC © 2007 Springer. Used with permission.

Complete Chapter List

Search this Book:
Reset