Data Field for Hierarchical Clustering

Data Field for Hierarchical Clustering

Shuliang Wang (The University of Pittsburgh, USA and Wuhan University, China), Wenyan Gan (Nanjing University of Science and Technology, China), Deyi Li (Tsinghua University, China) and Deren Li (Wuhan University, China)
Copyright: © 2011 |Pages: 21
DOI: 10.4018/jdwm.2011100103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

In this paper, data field is proposed to group data objects via simulating their mutual interactions and opposite movements for hierarchical clustering. Enlightened by the field in physical space, data field to simulate nuclear field is presented to illuminate the interaction between objects in data space. In the data field, the self-organized process of equipotential lines on many data objects discovers their hierarchical clustering-characteristics. During the clustering process, a random sample is first generated to optimize the impact factor. The masses of data objects are then estimated to select core data object with nonzero masses. Taking the core data objects as the initial clusters, the clusters are iteratively merged hierarchy by hierarchy with good performance. The results of a case study show that the data field is capable of hierarchical clustering on objects varying size, shape or granularity without user-specified parameters, as well as considering the object features inside the clusters and removing the outliers from noisy data. The comparisons illustrate that the data field clustering performs better than K-means, BIRCH, CURE, and CHAMELEON.
Article Preview

Introduction

The rapid advance in massive data acquisition, transmission and storage results in the growth of vast computerized datasets at unprecedented rates. These datasets come from various sectors, e.g., business, education, government, scientific community, Internet, or one of many readily available off-line and online data sources in the form of text, graphic, image, video, audio, animation, hyperlinks, markups, and so on (Li, Zhang, & Wang, 2006; Bhatnagar, Kaur, & Mignet, 2009). Moreover, they are continuously increasing and amassed in both attribute depth and scope of instances every time. Although many decisions are made on large datasets, the huge amounts of the computerized datasets have far exceeded human ability to completely interpret (Li et al., 2006). In order to understand and make full use of these data repositories when making decisions, it is necessary to develop some technique for uncovering the physical nature inside such huge datasets.

Clustering is one of the techniques to discover a segmentation rule from these data repositories. It assigns a set of objects into clusters (subsets) by virtue of their observations so that objects are similar to one another within the same cluster and are dissimilar to the objects in other clusters (Murtagh, 1983; Grabmeier & Rudolph, 2002; Xu & Wunsch, 2005; Li, Wang, & Li, 2006; Malik et al., 2010). It is an unsupervised technique without the knowledge what causes the grouping and how many groups exist (Song, Hu, & Yoo, 2009; Engle & Gangopadhyay, 2010; Silla & Freitas, 2011). The arbitrary shaped clustering was further treated (Wan, Wang, & Su, 2010). Clustering may be implemented on hierarchy, partition, density, grid, constraint, subspace and so on (Sander et al., 1998; Kwok et al., 2002; Grabmeier & Rudolph, 2002; Parsons, Haque, & Liu, 2004; Zhang et al., 2008; Horng et al., 2011).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing