Slawomir T. Wierzchon (Polish Academy of Sciences, Poland & University of Gdansk, Poland)

Copyright: © 2018
|Pages: 12

DOI: 10.4018/978-1-5225-2255-3.ch170

Chapter Preview

TopClustering is an exploratory activity relying upon dividing a given collection *X* of objects, or entities, into a set of categories, called groups or clusters, in such a way that any two objects placed in the same group have more in common than any two objects assigned to different groups. Consensus clustering has been proposed to overcome some drawbacks of individual clustering algorithms. Usually we assume that the clusters are disjoint subsets of *X* such that the objects belonging to a single cluster are sufficiently similar to each other (i.e. the clusters should be homogeneous), while objects from different clusters should be sufficiently diversified (i.e. clusters should be well separated). Splitting given collection into disjoint clusters is termed hard clustering. Otherwise we say about soft clustering, i.e. – depending on the formalism used – probabilistic or fuzzy clustering.

The most popular clustering algorithm is the *k*-means algorithm producing hard partitions – consult (Jain, 2010) for historical background and deeper discussion of its current improvements and variations. Soft version of the algorithm, called fuzzy *c*-means, was proposed by Bezdek (1981). This author used letter *c* to name the number of clusters, hence the name of the algorithm. Both the algorithms minimize the squared-error criteria. They are computationally efficient and do not require the user to specify many parameters. However, there are three main disadvantages of both the algorithms. First, they require that the entities must be represented as points in *n*-dimensional Euclidean space. To alleviate this assumption Hathaway, Davenport and Bezdek (1989) introduced relational version of fuzzy *c*-means algorithm: instead of the distance between the points representing the objects, a similarity measure between all pair of objects was used. In case of hard partitions the *k*-medoids algorithm was proposed: here a dissimilarity measure between pairs of objects replaces the distance measure – see e.g. Section 14.3.10 in (Hastie, Tibshirani, & Friedman, 2009). The second disadvantage results from the way in which objects are assigned to the clusters. Namely, in case of hard *k*-means each object is located to the cluster with nearest centroid (empirical mean of the cluster). Thus resulting clusters are spherical (more precisely, they are Voronoi regions). A similar assignment rule is used in the fuzzy *c*-means algorithm. Third disadvantage is that, the clusters should be of approximately similar cardinality and of similar shape. In case of unbalanced clusters erroneous results are frequently obtained. Similarly, if one cluster is located within a ball of small radius and the second – within an ellipsoid with one axis much greater than others, we can obtain erroneous results. Examples of “easy” and “difficult” data are depicted in Figure 1. Left panel presents compact, well separated, convex and linearly separated clusters, while non-convex clusters that are not linearly separated are shown in right panel.

To avoid these disadvantages, the ideas of ensemble methods used by machine learning community to improve results of classification methods, have been adapted to the requirements of clustering. In general, the ensemble methods use multiple models to obtain better predictive performance than could be obtained from any of the constituent models. A nice overview of these methods used in machine learning can be found e.g. in (Zhou, 2012). When transposed to the field of unsupervised learning (i.e. clustering) this idea translates to collecting multiple partitions of the same data. By combining these partitions, it is possible to obtain a good data partition even when the clusters are not compact and/or not well separated (Jain, 2010). Consensus clustering seems to be especially recommendable to analyze huge datasets. As noted in (Hore, Hall, & Goldgof, 2009): “The advantage of these approaches is that they provide a final partition of data that is comparable to the best existing approaches, yet scale to extremely large datasets. They can be 100,000 times faster while using much less memory.”

Supervised Learning: The problem of supervised learning is formulated as follows: Let X = { x 1 , …, x m } be a set of m objects or entities. Suppose these objects come from k classes and the membership to each object to exactly one class is known in advance. Then such a set is termed a training set T . It has the form T = {( x 1 , l 1 ) …, ( x m , l m )} where l i denotes the label of a class to which i -th object belongs. The aim of supervised learning is to find a mapping f : X ? L , such that f ( x i ) = l i , for all i = 1, …, m . The function f is said to be decision function or decision rule. If the labels l i are not known we say about unsupervised learning. Its task is to discover a structure in the set X .

Voronoi Regions: Suppose that the objects from the set X are represented as points in n -dimensional Euclidean space R n , and stands for the centroid (empirical mean) of i -th cluster C i . The Voronoi region associated with this centroid is the set of all points in R n whose distance to is not greater than their distance to the other centroids .

NP-Hard Problems: A notion used in computational complexity theory. It denotes a class of problems that are, informally, “at least as hard as the hardest problems in NP”. The shorthand NP means “non-deterministic polynomial-time.”

Cluster Analysis: It is an instance of unsupervised learning. In practice cluster analysis reduces to partitioning a given data set into an appropriate number of meaningful groups. However, the very nature of clustering is weakly understood. Consult e.g. the papers collected by Ben-David, et al . (2009) , or Henning (2015) AU79: The in-text citation "Henning (2015)" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. .

Entropy: In information theory this notion, introduced by Claude Shannon, is used to express unpredictability of information content. For instance, if a data set containing n items was divided into k groups each comprising n i items, then the entropy of such a partition is H = p 1 log( p 1 ) + … + p k log( p k ), where p i = n i / n . In case of two alternative partitions, the mutual information is a measure of the mutual dependence between these partitions.

Unsupervised Learning: Contrary to the supervised case we have only the data set X = { x 1 , …, x m } of m objects or entities. The aim of unsupervised learning is to find inner structure in this set, or equivalently, to find some regularities in this set. Clustering is a possible tool used to extract such information, but other approaches (e.g. self organizing maps) may be used as well. If the labels l i are known for a (usually small) subset of X we are faced with semi-supervised learning.

Search this Book:

Reset