Article Preview
Top1. Introduction
Clustering is the process of dividing the data based on their similarity (Bouveyron & Brunet, 2014). Clustering is used in a variety of fields such as similarity search and web mining, segmentation, compression, pattern recognition, bioinformatics and classification (Chen, Ye, Xu & Huang, 2012). Traditional clustering algorithms consider the full feature space of the data. As the datasets become larger and complex, so adaptations the existing algorithm to keep their quality and speed are required (McWilliams & Montana, 2014). For example, to capture a more accurate detection of complex image sources, high-dimensional data clustering is necessary for modern steganalysis using deep learning, which makes feature design more and more difficult (Qian, Dong, Wang & Tan, 2015). Another application of high dimensional data clustering is online face recognition. Due to the high-dimensional texture representation and the unconstrained optimization, real-time systems suffer from low efficiency. So high dimensional clustering can improve the runtime in such system (Wang, Gao, Tao, Yang & Li, 2017).
The curse of dimensionality is one of the main challenges of data clustering in high dimensional data sets (Gan & Ng, 2015). Curse of dimensionality increase dimension cardinality. In the other words the curse of dimensionality condemns all distances to look alike, thus rendering nearest neighbor data rather meaningless in high-dimensional data (Esmin, Coelho & Matwin, 2015). The dimension challenge of clustering process can be reviewed from two perspectives. First, some attributes used to determine relevant data for a cluster are considered as irrelevant. So, considering the full feature space, the existence of such attributes will create a more complex distance calculation for clustering process. For example, some clusters are placed in the subsets of some attributes and make it nearly impossible to properly identify the cluster in the full feature space. Second, there may be some differences between subsets of attributes which define a cluster and the ones which define another cluster. So, a global feature reduction procedure may be incapable of identifying a subspace which contains all the clusters. Accordingly, it might be significant to characterize clusters in an overlapping way, i.e. data d belongs to cluster C1 based on a subset of attributes while it can be placed into another cluster like C2 by considering other subsets of attributes (Sim, Gopalkrishnan, Zimek & Cong, 2013).
Subspace clustering is introduced to solve two previous challenges for traditional clustering. The aim of subspace clustering is to find all clusters in all subspaces (Sim et al., 2013). Most of the previous subspaces clustering algorithms are not scalable to high dimensional data. An increase in the dimension of data leads to increase in running time of clustering algorithm exponentially (Kriegel, Kröger & Zimek, 2012).The previous subspace clustering works assign each data in just one cluster or find clusters with overlapped together (Assent, 2012). The algorithms which assigning a data to a cluster are known as projection clustering algorithms (Tomašev, Radovanović, Mladenić & Ivanović, 2015). Some information of the data which placed in different clusters would be lost in these algorithms (Nie, Wang & Huang, 2014). On the other hand, the algorithms discovering the clusters with overlapped data produce more information, because such algorithms find different subspaces which the data of the cluster make the interpreting process of algorithms more difficult (Zimek, 2013). Most of the previous subspaces clustering algorithms are incapable of discovering clusters with a different shape, size and density and some of them are depending on input parameters.