Article Preview
TopIntroduction
Cluster analysis (or clustering) is an exploration of data to find similar groups or clusters of data objects. This fundamental problem arises not only in many applications including engineering, marketing, military, finance, biology, and linguistics but also in basic data pre-processing of any applications. Since there is no class label (predefined names of groups), clustering is also called unsupervised learning in contrast with regression and classification in supervised learning. There are various types of clustering: hierarchical versus partitional, exclusive versus overlapping versus fuzzy, and complete versus partial (Tan, Steinbach, & Kumar, 2005). Jain et al. (1999) overview well-known clustering techniques (e.g., hierarchical, k-means, nearest neighbor, and fuzzy clustering) and similarity measures (e.g., Euclidean, Mahalanobis, and mutual neighbor distance) along with the history of clustering.
The k-means clustering algorithm (MacQueen, 1967) is widely used and popular since it is simple and efficient. The k-means has advantages in applications with large data sets over hierarchical clustering as a partitional type. Its time complexity is linear in n or O(n), where n is the number of data points (Jain, Murty, & Kumar, 1999). Major disadvantages with this algorithm include 1) the number of clusters (called k) should be provided by users and 2) the algorithm converges to local minima. In regard to the local minima, when the k-means is implemented, initial starting points can affect the clustering result. Though several approaches are developed (e.g., k-means++ (Arthur & Vassilvitskii, 2007), global k-means (Likas, Vlassis, & Verbeek, 2007), approaches based on simulated annealing, genetic algorithm, and tabu search (Jain et al., 1999), etc.), multiple restarts with different starting points are still widely used (Jain et al., 1999).
The main focus in this paper is how to select the number of clusters. Generally, the assumption of knowing the right number of clusters is a big assumption and rarely satisfied in real practice. Moreover, the concept of the right number is not clear and frequently dependent on users.
In the literature, several approaches are discussed to determine the number of clusters (k) (Milligan & Cooper, 1983; Hardy, 1996; Pham, Dimov, & Nguyen, 2004; Chiang & Mirkin, 2010). First, a reasonable range of k can be tested. If there is a point where a significant change of the objective function value (also called a cost function) in the k-means happens while increasing k, the k can be chosen (the shape looks like an elbow). Second, information criteria such as the Bayesian information criterion (BIC) or Akaike information criterion (AIC) can be used as an objective function. The X-means clustering algorithm (Pelleg & Moore, 2000) extends the k-means by adopting the BIC. Third, other statistics are proposed for the estimation of k: Gap statistic (Tibshirani, Walther, & Hastie, 2001), Jump statistic (Sugar & James, 2003), Silhouette width statistic (Rousseuw, 1987; Lleti, Ortiz, Sarabia, & Sánchez, 2004), etc. While these approaches provide criteria to choose k in the k-means based on the statistical properties of data, user preferences are rarely considered in cluster analysis.