Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

Weighted Fuzzy-Possibilistic C-Means Over Large Data Sets

Renxia Wan (School of Electronics and Information Engineering, Tongji University, Shanghai, China & College of Information and Computation Science, Beifang University of Nationalities, Yinchuan, Ningxia, China), Yuelin Gao (College of Information and Computation Science, Beifang University of Nationalities, Yinchuan, Ningxia, China) and Caixia Li (Information Office, Donghua University, Shanghai, China)
Copyright: © 2012 |Pages: 26
DOI: 10.4018/jdwm.2012100104
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Up to now, several algorithms for clustering large data sets have been presented. Most clustering approaches for data sets are the crisp ones, which cannot be well suitable to the fuzzy case. In this paper, the authors explore a single pass approach to fuzzy possibilistic clustering over large data set. The basic idea of the proposed approach (weighted fuzzy-possibilistic c-means, WFPCM) is to use a modified possibilistic c-means (PCM) algorithm to cluster the weighted data points and centroids with one data segment as a unit. Experimental results on both synthetic and real data sets show that WFPCM can save significant memory usage when comparing with the fuzzy c-means (FCM) algorithm and the possibilistic c-means (PCM) algorithm. Furthermore, the proposed algorithm is of an excellent immunity to noise and can avoid splitting or merging the exact clusters into some inaccurate clusters, and ensures the integrity and purity of the natural classes.
Article Preview

Introduction

In recent years, the continual development of data storage techniques makes it possible to store magnanimous data. New requirement has been raised for new technology to transform these data information to knowledge. While the magnanimity of these data requires that the processing approaches cannot repeatedly scan such data set several times since the scanning cost may be intolerable.

Clustering is one of the most important tasks for data analysis. It partitions objects into meaningful groups called clusters according to given criteria. Cluster analysis has become one of the subject matters in several research fields such as statistics, pattern recognition, machine learning and data mining. Recently, various algorithms for clustering large data sets have been proposed. These algorithms are mainly based on sampling or incrementally loading structures. The sampling approaches (Aggarwal et al., 2009; Cheng et al., 1998; Guha et al., 1998; Kranen et al., 2011; Lee et al., 2009; Ng et al., 2002; Pal et al., 2002; Sakai et al., 2009; Yildizli et al., 2011) usually choose the samples by a certain rule such as chisquare or divergence hypothesis (Hathaway et al., 2006). The incremental approaches (Bradley et al., 1998; Farnstrom et al., 2000; Gupta et al., 2004; Karkkainen et al., 2007; Luhr et al., 2009; Nguyen-Hoang et al., 2009; Ning et al., 2009; O’Callaghan et al., 2002; Ramakrishnan et al., 1996; Siddiqui et al., 2009; Wan et al., 2010, 2011) generally maintain past knowledge from the previous runs of a clustering algorithm to produce or improve the future clustering model. Nevertheless, as Hore et al. (2007) pointed out, many existing algorithms for large and very large data sets are used for the crisp case, rarely for the fuzzy case. This is because fuzzy cluster needs to perform repeatedly the clustering iterations until the optimal solution or the acceptable approximate optimal solution is gained, and scan repeatedly the data set. This may greatly conflicts with the requirement of processing algorithm for large data set. Kwok, Smith, Lozano, and Taniar (2002) clustered insurance data set with an parallel fuzzy c-means (PFCM) clustering method. Hore, Hall, and Goldgof (2007) presented a single pass fuzzy c-means algorithm (SP) for clustering large data set, since FCM has innate sensitive dependence on noises, while in large data set, noises usually are unavoidable, and thus PFCM and SP have considerable trouble in noisy environments.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing