A Comparative Analysis of Rough Set Based Intelligent Techniques for Unsupervised Gene Selection

A Comparative Analysis of Rough Set Based Intelligent Techniques for Unsupervised Gene Selection

P. K. Nizar Banu (Department of Computer Applications, B. S. Abdur Rahman University, Chennai, Tamil Nadu, India) and H. Hannah Inbarani (Department of Computer Science, Periyar University, Chennai, Tamil Nadu, India)
Copyright: © 2013 |Pages: 14
DOI: 10.4018/ijsda.2013100103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

As the micro array databases increases in dimension and results in complexity, identifying the most informative genes is a challenging task. Such difficulty is often related to the huge number of genes with very few samples. Research in medical data mining addresses this problem by applying techniques from data mining and machine learning to the micro array datasets. In this paper Unsupervised Tolerance Rough Set based Quick Reduct (U-TRS-QR), a diverse feature selection algorithm, which extends the existing equivalent rough sets for unsupervised learning, is proposed. Genes selected by the proposed method leads to a considerably improved class predictions in wide experiments on two gene expression datasets: Brain Tumor and Colon Cancer. The results indicate consistent improvement among 12 classifiers.
Article Preview

Introduction

Dimensionality reduction has received considerable attention in micro array data analysis as a way to select the most informative genes by removing the least informative genes. Feature extraction and Feature selection are the two important methods used for dimensionality reduction. Feature extraction transforms all or a part of the features to a lower dimension space whereas feature selection selects a subset of the original features. In the analysis of gene expression dataset, feature selection bears a significant advantage over feature extraction methods (Assaf, 2009). Feature selection refers to the process of selecting highly informative features that are most effective in characterizing a given field. It addresses the specific task of finding a subset of given features that are useful to solve the domain problem without disrupting the underlying meaning of the selected features. Many criteria can be employed to measure the similarity among features (Mitra et al., 2002). Feature selection methods can be divided into supervised and unsupervised. Genetic algorithms have been successfully used as an efficient method of supervised feature selection for a high-dimensional spectral dataset (Cho et al., 2008; Davis et al., 2006). Supervised feature selection problems have been formulated by a multiple hypothesis testing procedure that controls the false discovery rate (Mei et al., 2009; Kim et al., 2008). Instead of investigating supervised/unsupervised feature extraction and supervised feature selection, few attempts have been made to identify the important features by using unsupervised feature selection methods (Mao, 2005). Unsupervised feature selection methods usually have been divided into three categories — wrapper, filter, and hybrid approaches (Kim & Gao, 2006). Dy and Brodley (2000) introduced a wrapper approach that uses Expectation Maximization (EM) clustering algorithm. Trevor Hastie et al., (2000) developed a gene-shaving method that used its first principal component to identify the best subsets of those features with large variations. Ding (2003) proposed a two-way ordering approach in which relevant genes were selected based on their similarity information. PCA is a widely used unsupervised feature extraction method in that the process depends solely upon the input variables, and does not take into account, information from the output variable (Jolliffe, 2002).

This paper studies and analyses gene expression datasets that contains large number of features (genes), the majority of the genes are not relevant to describe the problem and in turn degrades the classification performance. Recent approaches in feature selection include probabilistic neural networks (Huang, 2004), Support Vector Machines (Cao et al., 2003), neuro-fuzzy computing (Chu et al., 2004), neuro-genetic hybridization (Karzynski et al., 2003), sparse unsupervised dimensionality reduction method (Dou et al., 2010), Unsupervised Quick Reduct (Velayutham & Thangavel, 2011) and Unsupervised Relative Reduct (Velayutham & Thangavel, 2011). The idea behind feature selection is to retain genes that have a major role in arriving at a decision about the output classes.

The gene expression datasets consists of real values which expresses the functional value of every gene. Most of the existing supervised and unsupervised feature selection algorithms discretize or normalize the original values which results in certain information loss. Traditional rough sets are also incapable of dealing with real-valued datasets. This problem is addressed in this paper by introducing the extension of rough sets called Tolerance Rough Sets (TRS). With the help of TRS the values of gene expressions are preserved as such. In this paper features and genes are used interchangeably.

The rest of the paper is organized as follows. The first section focuses on research background. A brief introduction about tolerance rough sets is presented. The proposed algorithm for gene selection with worked example is described. The Experimental results are then presented, discussed and analyzed. Finally, our concluding remarks and future work is presented.

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 6: 4 Issues (2017)
Volume 5: 4 Issues (2016)
Volume 4: 4 Issues (2015)
Volume 3: 4 Issues (2014)
Volume 2: 4 Issues (2013)
Volume 1: 4 Issues (2012)
View Complete Journal Contents Listing