Data clustering is a discovery process that partitions a data set into groups (clusters) such that data points within the same group have high similarity while being very dissimilar to points in other groups (Han & Kamber, 2001). The ultimate goal of data clustering is to discover natural groupings in a set of patterns, points, or objects without prior knowledge of any class labels. In fact, in the machine-learning literature, data clustering is typically regarded as a form of unsupervised learning as opposed to supervised learning. In unsupervised learning or clustering, there is no training function as in supervised learning. There are many applications for data clustering including, but not limited to, pattern recognition, data analysis, data compression, image processing, understanding genomic data, and market-basket research.
Key Terms in this Chapter
Data Clustering: Data clustering is a discovery process that partitions a data set into groups such that data points within a group have high similarity in comparison to one another but are very dissimilar to points in other groups.
Data Mining: Data mining is a knowledge discovery process that focuses on extracting previously unknown, actionable information from very large databases.
Unsupervised Learning: This is a machine-learning approach in which a model is fit to a given set of observations. It is distinguished from supervised learning by the fact that there is no a priori output.
Cluster: A cluster is a group of objects having some common natural characteristics.
Hybrid Clustering: Hybrid clustering is a clustering process that partitions a data set into preliminary clusters and then constructs a hierarchical structure upon these subclusters based on a given similarity measure.
Hierarchical Clustering: Hierarchical clustering is the process of creating a hierarchical decomposition of a data set.
Distance Measure: This measure is a metric that is used to compute the distance between two data objects. The most commonly used distance measures are Manhattan distance and Euclidean distance.
Partitioning Clustering: Partitioning clustering is the process of generating a single partition of the data in an attempt to recover any natural groupings hidden in the data.