Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

Subspace Clustering for High-Dimensional Data Using Cluster Structure Similarity

Kavan Fatehi (Yazd University, Department of Computer Engineering, Yazd, Islamic Republic of Iran), Mohsen Rezvani (Shahrood University of Technology, Department of Computer Engineering, Shahrood, Islamic Republic of Iran), Mansoor Fateh (Shahrood University of Technology, Department of Computer Engineering, Shahrood, Islamic Republic of Iran) and Mohammad-Reza Pajoohan (Yazd University, Department of Computer Engineering, Yazd, Islamic Republic of Iran)
Copyright: © 2018 |Pages: 18
DOI: 10.4018/IJIIT.2018070103


This article describes how recently, because of the curse of dimensionality in high dimensional data, a significant amount of research has been conducted on subspace clustering aiming at discovering clusters embedded in any possible attributes combination. The main goal of subspace clustering algorithms is to find all clusters in all subspaces. Previous studies have mostly been generating redundant subspace clusters, leading to clustering accuracy loss and also increasing the running time of the algorithms. A bottom-up density-based approach is suggested in this article, in which the cluster structure serves as a similarity measure to generate the optimal subspaces which result in raising the accuracy of the subspace clustering. Based on this idea, the algorithm discovers similar subspaces by considering similarity in their cluster structure, then combines them and the data in the new subspaces would be clustered again. Finally, the algorithm determines all the subspaces and also finds all clusters within them. Experiments on various synthetic and real datasets show that the results of the proposed approach are significantly better in quality and runtime than the state-of-the-art on clustering high-dimensional data.
Article Preview

1. Introduction

Clustering is the process of dividing the data based on their similarity (Bouveyron & Brunet, 2014). Clustering is used in a variety of fields such as similarity search and web mining, segmentation, compression, pattern recognition, bioinformatics and classification (Chen, Ye, Xu & Huang, 2012). Traditional clustering algorithms consider the full feature space of the data. As the datasets become larger and complex, so adaptations the existing algorithm to keep their quality and speed are required (McWilliams & Montana, 2014). For example, to capture a more accurate detection of complex image sources, high-dimensional data clustering is necessary for modern steganalysis using deep learning, which makes feature design more and more difficult (Qian, Dong, Wang & Tan, 2015). Another application of high dimensional data clustering is online face recognition. Due to the high-dimensional texture representation and the unconstrained optimization, real-time systems suffer from low efficiency. So high dimensional clustering can improve the runtime in such system (Wang, Gao, Tao, Yang & Li, 2017).

The curse of dimensionality is one of the main challenges of data clustering in high dimensional data sets (Gan & Ng, 2015). Curse of dimensionality increase dimension cardinality. In the other words the curse of dimensionality condemns all distances to look alike, thus rendering nearest neighbor data rather meaningless in high-dimensional data (Esmin, Coelho & Matwin, 2015). The dimension challenge of clustering process can be reviewed from two perspectives. First, some attributes used to determine relevant data for a cluster are considered as irrelevant. So, considering the full feature space, the existence of such attributes will create a more complex distance calculation for clustering process. For example, some clusters are placed in the subsets of some attributes and make it nearly impossible to properly identify the cluster in the full feature space. Second, there may be some differences between subsets of attributes which define a cluster and the ones which define another cluster. So, a global feature reduction procedure may be incapable of identifying a subspace which contains all the clusters. Accordingly, it might be significant to characterize clusters in an overlapping way, i.e. data d belongs to cluster C1 based on a subset of attributes while it can be placed into another cluster like C2 by considering other subsets of attributes (Sim, Gopalkrishnan, Zimek & Cong, 2013).

Subspace clustering is introduced to solve two previous challenges for traditional clustering. The aim of subspace clustering is to find all clusters in all subspaces (Sim et al., 2013). Most of the previous subspaces clustering algorithms are not scalable to high dimensional data. An increase in the dimension of data leads to increase in running time of clustering algorithm exponentially (Kriegel, Kröger & Zimek, 2012).The previous subspace clustering works assign each data in just one cluster or find clusters with overlapped together (Assent, 2012). The algorithms which assigning a data to a cluster are known as projection clustering algorithms (Tomašev, Radovanović, Mladenić & Ivanović, 2015). Some information of the data which placed in different clusters would be lost in these algorithms (Nie, Wang & Huang, 2014). On the other hand, the algorithms discovering the clusters with overlapped data produce more information, because such algorithms find different subspaces which the data of the cluster make the interpreting process of algorithms more difficult (Zimek, 2013). Most of the previous subspaces clustering algorithms are incapable of discovering clusters with a different shape, size and density and some of them are depending on input parameters.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 15: 4 Issues (2019): Forthcoming, Available for Pre-Order
Volume 14: 4 Issues (2018)
Volume 13: 4 Issues (2017)
Volume 12: 4 Issues (2016)
Volume 11: 4 Issues (2015)
Volume 10: 4 Issues (2014)
Volume 9: 4 Issues (2013)
Volume 8: 4 Issues (2012)
Volume 7: 4 Issues (2011)
Volume 6: 4 Issues (2010)
Volume 5: 4 Issues (2009)
Volume 4: 4 Issues (2008)
Volume 3: 4 Issues (2007)
Volume 2: 4 Issues (2006)
Volume 1: 4 Issues (2005)
View Complete Journal Contents Listing