Robust Clustering with Distance and Density

Robust Clustering with Distance and Density

Hanning Yuan (School of Software, Beijing Institute of Technology, Beijing, China), Shuliang Wang (School of Software, Beijing Institute of Technology, Beijing, China), Jing Geng (Beijing Institute of Technology, Beijing, China), Yang Yu (Beijing Institute of Technology, Beijing, China) and Ming Zhong (Beijing Institute of Technology, Beijing, China)
Copyright: © 2017 |Pages: 12
DOI: 10.4018/IJDWM.2017040104

Abstract

Clustering is fundamental for using big data. However, AP (affinity propagation) is not good at non-convex datasets, and the input parameter has a marked impact on DBSCAN (density-based spatial clustering of applications with noise). Moreover, new characteristics such as volume, variety, velocity, veracity make it difficult to group big data. To address the issues, a parameter free AP (PFAP) is proposed to group big data on the basis of both distance and density. Firstly, it obtains a group of normalized density from the AP clustering. The estimated parameters are monotonically. Then, the density is used for density clustering for multiple times. Finally, the multiple-density clustering results undergo a two-stage amalgamation to achieve the final clustering result. Experimental results on several benchmark datasets show that PFAP has been achieved better clustering quality than DBSCAN, AP, and APSCAN. And it also has better performance than APSCAN and FSDP.
Article Preview

AP is a distance-based algorithm for identifying exemplars in a dataset by imitating the message passing and feedback routine between the data objects (Dueck & Frey, 2007; Frey& Dueck, 2007). It enjoys lower error than traditional methods, which is computationally efficient in many applications (Dueck & Frey, 2007; Dueck et al., 2008). The measurements of the mutual similarity among objects are recorded in an input matrix of . The diagonal of the matrix, , is treated as the reference for the data object to become the cluster center. The responsibility that is sent from data object to the candidate clustering center , indicates how suitable an object can be used as a cluster center for the object . The availability that is sent from the candidate cluster center to the data object , reflects how likely the object chooses as its cluster center. The larger the value of and , the higher the probability that object is to become the cluster center. Consequently, increase the chance that an object belongs to a cluster with its center at object . During this iterative process, AP keeps updating and between the data objects until the predefined convergence criteria is met. The AP parameters adhere to what are used in its original settings, for example, a maximum of iterations is 1000, the upper limit of steady times is 100, and the damping coefficient is 0.9. The reference of clustering center is chosen to be the median value of similarity matrix.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 15: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 14: 4 Issues (2018): 2 Released, 2 Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing