Multi-Attribute Utility Theory Based K-Means Clustering Applications

Multi-Attribute Utility Theory Based K-Means Clustering Applications

Jungmok Ma (Korea National Defense University, Seoul, South Korea)
Copyright: © 2017 |Pages: 12
DOI: 10.4018/IJDWM.2017040101

Abstract

One of major obstacles in the application of the k-means clustering algorithm is the selection of the number of clusters k. The multi-attribute utility theory (MAUT)-based k-means clustering algorithm is proposed to tackle the problem by incorporating user preferences. Using MAUT, the decision maker's value structure for the number of clusters and other attributes can be quantitatively modeled, and it can be used as an objective function of the k-means. A target clustering problem for military targeting process is used to demonstrate the MAUT-based k-means and provide a comparative study. The result shows that the existing clustering algorithms do not necessarily reflect user preferences while the MAUT-based k-means provides a systematic framework of preferences modeling in cluster analysis.
Article Preview

Introduction

Cluster analysis (or clustering) is an exploration of data to find similar groups or clusters of data objects. This fundamental problem arises not only in many applications including engineering, marketing, military, finance, biology, and linguistics but also in basic data pre-processing of any applications. Since there is no class label (predefined names of groups), clustering is also called unsupervised learning in contrast with regression and classification in supervised learning. There are various types of clustering: hierarchical versus partitional, exclusive versus overlapping versus fuzzy, and complete versus partial (Tan, Steinbach, & Kumar, 2005). Jain et al. (1999) overview well-known clustering techniques (e.g., hierarchical, k-means, nearest neighbor, and fuzzy clustering) and similarity measures (e.g., Euclidean, Mahalanobis, and mutual neighbor distance) along with the history of clustering.

The k-means clustering algorithm (MacQueen, 1967) is widely used and popular since it is simple and efficient. The k-means has advantages in applications with large data sets over hierarchical clustering as a partitional type. Its time complexity is linear in n or O(n), where n is the number of data points (Jain, Murty, & Kumar, 1999). Major disadvantages with this algorithm include 1) the number of clusters (called k) should be provided by users and 2) the algorithm converges to local minima. In regard to the local minima, when the k-means is implemented, initial starting points can affect the clustering result. Though several approaches are developed (e.g., k-means++ (Arthur & Vassilvitskii, 2007), global k-means (Likas, Vlassis, & Verbeek, 2007), approaches based on simulated annealing, genetic algorithm, and tabu search (Jain et al., 1999), etc.), multiple restarts with different starting points are still widely used (Jain et al., 1999).

The main focus in this paper is how to select the number of clusters. Generally, the assumption of knowing the right number of clusters is a big assumption and rarely satisfied in real practice. Moreover, the concept of the right number is not clear and frequently dependent on users.

In the literature, several approaches are discussed to determine the number of clusters (k) (Milligan & Cooper, 1983; Hardy, 1996; Pham, Dimov, & Nguyen, 2004; Chiang & Mirkin, 2010). First, a reasonable range of k can be tested. If there is a point where a significant change of the objective function value (also called a cost function) in the k-means happens while increasing k, the k can be chosen (the shape looks like an elbow). Second, information criteria such as the Bayesian information criterion (BIC) or Akaike information criterion (AIC) can be used as an objective function. The X-means clustering algorithm (Pelleg & Moore, 2000) extends the k-means by adopting the BIC. Third, other statistics are proposed for the estimation of k: Gap statistic (Tibshirani, Walther, & Hastie, 2001), Jump statistic (Sugar & James, 2003), Silhouette width statistic (Rousseuw, 1987; Lleti, Ortiz, Sarabia, & Sánchez, 2004), etc. While these approaches provide criteria to choose k in the k-means based on the statistical properties of data, user preferences are rarely considered in cluster analysis.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 15: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 14: 4 Issues (2018): 2 Released, 2 Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing