A Cosine-Similarity Mutual-Information Approach for Feature Selection on High Dimensional Datasets

A Cosine-Similarity Mutual-Information Approach for Feature Selection on High Dimensional Datasets

Vimal Kumar Dubey (Guru Ghasidas Vishwavidyalaya, Bilaspur, India) and Amit Kumar Saxena (Guru Ghasidas Vishwavidyalaya, Bilaspur, India)
Copyright: © 2017 |Pages: 14
DOI: 10.4018/JITR.2017010102
OnDemand PDF Download:
List Price: $37.50


A novel hybrid method based on Cosine Similarity and Mutual Information is presented to find out relevant feature subset. Initially, the supervised Cosine Similarity of each feature is calculated with respect to the class vector and then features are grouped based on the obtained cosine similarity values. From each group the best mutual informative feature is selected. The selected features subset is tested using the three classifiers namely Naïve Bayes (NB), K-Nearest Neighbor and Classification and Regression trees (CART) for getting classification accuracy. The proposed method is applied to various high dimensional datasets. Obtained results showed that the proposed method is capable of eliminating the redundant and irrelevant features.
Article Preview


High-dimensional datasets like colon, prostate, and others have the property that numbers of patterns are very less compared to the number of features. Classification (or prediction) of such high-dimensional datasets is very problematic due to a number of features, hence in the recent decades, researchers have focused on feature selection techniques with more intensity. Classification is an indispensable part of data mining (Han & Kamber, 2006), machine learning (Mitchell, 1997) or pattern recognition (Duda et al., 2001) and it is defined as labeling of an unseen pattern based on some information or rules. Classifiers like Support Vector Machines (Cortes & Vapnik, 1995), Naive Bayes (John & Langley, 1995), Artificial Neural Networks (Haykin, 1999) and others take training data and train themselves when an unseen pattern is given to them they labeled those patterns. Since classifiers performance is dependent upon the training data hence elimination of irrelevant, redundant and noisy features is very necessary. Classification accuracy can be increased if non-redundant, relevant and noise free dataset is used for learning. On the contrary, if irrelevant, redundant and noisy features are present in the dataset, it will decrease the classifier performance (accuracy commonly) and often it is termed as Curse of Dimensionality (Jain & Zongker, 1997). Removing irrelevant, redundant and noisy features is termed as Dimensionality Reduction (Liu & Motoda, 1998; Guyon et al., 2006; Saxena et al., 2009). Feature selection (Liu & Motoda, 1998; Saxena et al., 2009) and feature extraction are two well-known methods applied for the dimensionality reduction problem. Processing the existing features of the dataset to obtain new features is termed as feature extraction while the selection of a subset of features from existing set of features without a single extra effort is known as feature selection. In this paper, a novel method is proposed to achieve feature selection in databases. The extensive applications of feature selection can be found in medical (Artioli et al, 1995) data mining (Han & Kamber, 2006), classification and other areas (Maimon & Rokach, 2010).

In this proposed method, cosine similarity and mutual information is hybridized to remove the redundant and irrelevant features. Supervised cosine similarity is used as a measure to group the features and then information gain is used to select a best relevant feature from each group. This method provides a way to remove the redundant features.

This paper is organized into following sections. Section 2 introduces related works on feature selection algorithms. Some preliminaries about the terms used in the paper at later stages viz. A measure of similarities and Mutual Information or information gain is given in Section 3. The proposed method is explained via algorithms and model in Section 4. Section 5 lists and briefly explains datasets used in the experiment and experiments. Section 6 and 7 contains results derived from proposed method on listed datasets with a comparison with other methods. Section -8 concludes the paper with the future scope.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 10: 4 Issues (2017)
Volume 9: 4 Issues (2016)
Volume 8: 4 Issues (2015)
Volume 7: 4 Issues (2014)
Volume 6: 4 Issues (2013)
Volume 5: 4 Issues (2012)
Volume 4: 4 Issues (2011)
Volume 3: 4 Issues (2010)
Volume 2: 4 Issues (2009)
Volume 1: 4 Issues (2008)
View Complete Journal Contents Listing