Clustering Hybrid Data Using a Neighborhood Rough Set Based Algorithm and Expounding its Application

Clustering Hybrid Data Using a Neighborhood Rough Set Based Algorithm and Expounding its Application

Akarsh Goyal (Department of Computer Science, Viterbi School of Engineering, University of Southern California, Los Angeles, USA) and Rahul Chowdhury (VIT University, Vellore, India)
Copyright: © 2019 |Pages: 17
DOI: 10.4018/IJFSA.2019100105
This article was retracted

Abstract

In recent times, an enumerable number of clustering algorithms have been developed whose main function is to make sets of objects have almost the same features. But due to the presence of categorical data values, these algorithms face a challenge in their implementation. Also, some algorithms which are able to take care of categorical data are not able to process uncertainty in the values and therefore have stability issues. Thus, handling categorical data along with uncertainty has been made necessary owing to such difficulties. So, in 2007 an MMR algorithm was developed which was based on basic rough set theory. MMeR was proposed in 2009 which surpassed the results of MMR in taking care of categorical data but cannot be used robustly for hybrid data. In this article, the authors generalize the MMeR algorithm with neighborhood relations and make it a neighborhood rough set model which this article calls MMeNR (Min Mean Neighborhood Roughness). It takes care of the heterogeneous data. Also, the authors have extended the MMeNR method to make it suitable for various applications like geospatial data analysis and epidemiology.
Article Preview
Top

Introduction

We are living in a world full of data generated from several sources. Data describe the characteristics of living species, depict the properties of a natural phenomenon, summarize the results of a scientific experiment, and record the dynamics of a running machinery system. More importantly, data provide a basis for further analysis, reasoning, decisions, and ultimately, for the understanding of all kinds of objects and phenomena. Clustering is one of the most important data analysis activities, which helps to classify or group data having similar properties into a set of categories or clusters. It has been observed in (Anderberg, 1973; Everitt et al., 2001) that classification is one of the most primitive activities of human beings and plays an important and indispensable role in their long history. In order to learn a new object or understand a new phenomenon, people always try to identify descriptive feature and further compare these features with those of known objects or phenomena, based on their similarity or dissimilarity, generalized as proximity, according to some standards or rules. Actually, naming and classifying are essentially synonymous, according to Everitt et al. (2001). With such classification information at hand; we can infer the properties of a specific object based on the category to which it belongs. Clustering (Huang, 1998) is used to make small subsets which can be easily managed, analysed and taken care of by segmenting large hybrid data sets. Groupings which come naturally to the objects are found out using clustering. Many areas make use of clustering techniques. For instance, gene data complexity handling method was made by Wu et al. using clustering. Clustering techniques which can be used for the analysis of gene expression data (Jiang, Tang, & Zhang, 2004) were developed by Jiang et al. Positron emission tomography (PET) method (Wong, Feng, Meikle, & Fulham, 2002) was given by Wong et al. In this nuclear medical imaging was used to segment the tissues. In 1989 the segmentation of radar signals while scanning land and marine objects was done by using cluster analysis (Haimov et al., 1989). High scale research and development planning using cluster analysis was developed in (Mathieu & Gibson, 1993). These techniques mostly handle only numerical datasets. Hence, these cannot be used for data sets which have domains that are categorical (Gibson, Kleinberg, & Raghavan, 2000; Guha, Rastogi, & Shim, 2000). Earlier works in the field of clustering used to develop algorithms which could only take care of numerical data (Dempster, Laird, & Rubin, 1977) as it was very easy to formulate similarity functions between them. However, when it comes to categorical data, it becomes difficult as they have features which are multi-valued. The correspondence is in the form of values which are same in a given attribute and also objects which are similar. Because of this we have to see both in the rows as well as the columns for the similarity.

Some earliest clustering methods for categorical datasets are due to (Dempster et al., 1977; Guha et al., 2000 and Gibson et al., 2000). But these methods are not capable handling uncertainty in data. Thus, these algorithms have stability issues, which render them ineffective for real world databases having uncertainty inherent in them.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 9: 4 Issues (2020): Forthcoming, Available for Pre-Order
Volume 8: 4 Issues (2019)
Volume 7: 4 Issues (2018)
Volume 6: 4 Issues (2017)
Volume 5: 4 Issues (2016)
Volume 4: 4 Issues (2015)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing