Particle Swarm Optimizer for High-Dimensional Data Clustering

Particle Swarm Optimizer for High-Dimensional Data Clustering

Yanping Lu (Xiamen University, China) and Shaozi Li (Xiamen University, China)
Copyright: © 2011 |Pages: 21
DOI: 10.4018/978-1-61692-797-4.ch002
OnDemand PDF Download:
No Current Special Offers


This chapter aims at developing effective particle swarm optimization (PSO) for two problems commonly encountered in studies related to high-dimensional data clustering, namely the variable weighting problem in soft projected clustering with known the number of clusters k and the problem of automatically determining the number of clusters k. Each problem is formulated to minimize a nonlinear continuous objective function subjected to bound constraints. Special treatments of encoding schemes and search strategies are also proposed to tailor PSO for these two problems. Experimental results on both synthetic and real high-dimensional data show that these two proposed algorithms greatly improve cluster quality. In addition, the results of the new algorithms are much less dependent on the initial cluster centroids. Experimental results indicate that the promising potential pertaining to PSO applicability to clustering high-dimensional data.
Chapter Preview

1. Introduction

Clustering high-dimensional data is a common but important task in various data mining applications. A fundamental starting point for data mining is the assumption that a data object can be represented as a high-dimensional feature vector. Text clustering is a typical example. In text mining, a text data set is viewed as a matrix, in which a row represents a document and a column represents a unique term. The number of dimensions corresponds to the number of unique terms, which is usually in the hundreds or thousands. Another application where high-dimensional data occurs is insurance company customer prediction. It is important to separate potential customers into groups to help companies predict who would be interested in buying an insurance policy. Many other applications such as bankruptcy prediction, web mining, protein function prediction, etc. present similar data analysis problems.

Clustering high-dimensional data is a difficult task because clusters of high-dimensional data are usually embedded in lower-dimensional subspaces and feature subspaces for different clusters can overlap. In a text data set, documents related to a particular topic are characterized by one subset of terms. For example, a group of documents are categorized under the topic electronics because they contain a subset of terms such as electronics, signal, circuit, etc. The terms describing another topic, athlete, may not occur in the documents on electronics but will occur in the documents relating to sports.

Traditional clustering algorithms struggle with high-dimensional data because the quality of results deteriorates due to the curse of dimensionality. As the number of dimensions increases, data becomes very sparse and distance measures in the whole dimension space become meaningless. Irrelevant dimensions spread out the data points until they are almost equidistant from each other in very high dimensions. The phenomenon is exacerbated when objects are related in different ways in different feature subsets. In fact, some dimensions may be irrelevant or redundant for centain clusters and different sets of dimensions may be relevant for different clusters. Thus, clusters should often be searched for in subspaces of dimensions rather than the whole dimension space.

Clustering of such data sets uses an approach called subspace clustering or projected clustering, aimed at finding clusters from different subspaces. Subspace clustering in general seeks to identify all the subspaces of the dimension space where clusters are most well-separated {see for instance (Goil1, 1999, Woo, 2004)}. The terms subspace clustering and projected clustering are not always used in a consistent way in the literature, but as a general rule, subspace clustering algorithms compute overlapping clusters, whereas projected clustering aims to partition the data set into disjoint clusters {See for instance (Procopiuc, 2002, Achtert, 2008, Moise, 2008)}. Often, projected clustering algorithms search for clusters in subspaces, each of which is spanned by a number of base vectors (main axes). The performance of many subspace/projected clustering algorithms drops quickly with the size of the subspaces in which the clusters are found (Parsons, 2004). Also, many of them require domain knowledge provided by the user to help select and tune their settings, such as the maximum distance between dimensional values (Procopiuc, 2002), the thresholds of input parameters (Moise, 2008) and the minimum density (Agrawal, 2005), which are difficult to establish.

Recently, a number of soft projected clustering algorithms have been developed to identify clusters by assigning an optimal variable weight vector to each cluster (Domeniconi, 2007, Huang, 2005, Jing, 2007). Each of these algorithms iteratively minimizes an objective function. Although the cluster membership of an object is calculated in the whole variable space, the similarity between each pair of objects is based on weighted variable differences. The variable weights transform distance so that the associated cluster is reshaped into a dense hypersphere and can be separated from other clusters. Soft projected clustering algorithms are driven by evaluation criteria and search strategies. Consequently, defining the objective function and efficiently determining the optimal variable weights are the two most important issues in soft projected clustering.

Complete Chapter List

Search this Book: