Article Preview
TopIntroduction
Data Mining is a process of discovering new, unexpected, valuable patterns from existing databases (Agrawal, Mannila, Srikant, Toivonen, & Verkamo, 1996; Chen, Han, & Yu, 1996). Data mining is finding increasing popularity in science and practice areas that need to analyze large amounts of data to discover patterns and trends, in which they could not otherwise find. Different applications may require different data mining techniques. The kinds of knowledge that could be discovered from a database are categorized into association rules mining, sequential patterns mining, classification and clustering (Daly & Taniar, 2004).
Clustering is the process of grouping a data set in such a way that the similarity between data within a cluster is maximized while the similarity between data of different clusters is minimized (Kwok, Smith, Lozano, & Taniar, 2002).
Clustering methods partition a set of objective into subsets such that objects in the same subsets are more similar to each other than objects in different ones according to some defined criteria (Ng & Wong, 2002). Clustering task does not try to classify, estimate, or predict the value of a target variable. Instead, they seek to segment the entire data set into comparatively homogeneous clusters. The similarity of records inside each cluster is maximized and the similarity of records between clusters is minimized (Larose, 2004).
Resulting groups or clusters lead the end user to estimate great data values (Berson & Smith, 2002). One of the main advantages of clustering technique is that there is no need of basic knowledge regarding to data and we only have to choose a good index for measuring. Another advantage of this technique is to be appropriate for any kind of data by choosing a suitable criterion (Jain, Murty, & Flynn, 1999).
Clustering has been effectively applied in a variety of anthropology, biology, medicine, psychology, statistics, mathematics, engineering, and computer science (Celebi, Kingravi, & Vela, 2013). It is often included as first step in data mining process and the results of clustering are used in the other models as inputs (Hand, Mannila, & Smyth, 2005). It should be considered that due to different aspects, clustering of observations is a difficult problem to solve and categorized in class of NP-complete problems (Güngör & Ünler, 2007).
Clustering algorithms can be broadly classified into two groups: hierarchical and partitional (Jain, 2010). Hierarchical algorithms recursively find nested clusters either in a top-down (divisive) or bottom-up (agglomerative) fashion. In contrast, partitional algorithms find all the clusters simultaneously as a partition of the data and do not impose a hierarchical structure (Celebi, et al., 2013). In other words, partitional clustering algorithms divide the data set into non-overlapping groups (Jain, et al., 1999). Algorithms such as k-mean, bisecting k-mean, and k-modes belong to this category. Furthermore, some other bio-inspired algorithms like Artificial Neural Networks (ANNs), heuristics and Evolutionary Algorithms (EAs) are frequently applied for clustering problems.
K-means as one of the most famous clustering algorithm is a straightforward and effective way for finding clusters in data (Jain, 2010). The popularity of K-means can be attributed to several reasons. First, it is conceptually simple and easy to implement. Almost all data mining software includes an implementation of it. Second, it is versatile; meanwhile almost every aspect of the algorithm (initialization, distance function, termination criterion, etc.) can be modified. Third, it has a time complexity that is linear in N, D, and K (in general, D_N and K_N). For this reason, it can be used to initialize more expensive clustering algorithms. Fourth, it has a storage complexity that is linear in N, D, and K. Fifth, it is guaranteed to converge at a quadratic rate. Finally, it is invariant to data ordering, i.e. random shuffling of the data points (Celebi, et al., 2013).