Feature Selection Algorithm Using Relative Odds for Data Mining Classification

Feature Selection Algorithm Using Relative Odds for Data Mining Classification

Donald Douglas Atsa'am (Department of Mathematics, Statistics and Computer Science, University of Agriculture, Makurdi, Nigeria)
Copyright: © 2020 |Pages: 26
DOI: 10.4018/978-1-5225-9750-6.ch005

Abstract

A filter feature selection algorithm is developed and its performance tested. In the initial step, the algorithm dichotomizes the dataset then separately computes the association between each predictor and the class variable using relative odds (odds ratios). The value of the odds ratios becomes the importance ranking of the corresponding explanatory variable in determining the output. Logistic regression classification is deployed to test the performance of the new algorithm in comparison with three existing feature selection algorithms: the Fisher index, Pearson's correlation, and the varImp function. A number of experimental datasets are employed, and in most cases, the subsets selected by the new algorithm produced models with higher classification accuracy than the subsets suggested by the existing feature selection algorithms. Therefore, the proposed algorithm is a reliable alternative in filter feature selection for binary classification problems.
Chapter Preview
Top

Introduction

This chapter is about developing a filter algorithm that uses relative odds, also referred to as odds ratios (OR) (Vanderweele & Vansteelandt, 2010) in selecting relevant features for binary classification models. Big data consists of voluminous datasets emanating from commercial, social media, educational, health care, government, and industrial activities, etc. In most instances, these datasets comprise high-dimensional features such that some are redundant, irrelevant, duplicative and unnecessary in machine learning. When machine learning models are constructed on the entire feature space without first expunging the irrelevant features, resultant models produce poor fit and low predictive accuracies. In addition, models are complex, waste system resources and are difficult to interpret. For this reason, it is necessary to select the most important and relevant features for inclusion in classification models.

The OR is a statistical measure that evaluates the strength of association between two binary variables, where one variable is dependent on the other (Hancock & Kent, 2016). In this research, the feature selection algorithm to be developed will evaluate the OR between each predictor variable and the class variable. The result of the OR will represent how important the corresponding predictor is in determining the outcome. When comparing the importance of two predictors, the one with higher value of OR is considered the most important. Comparing computed values of all predictors in the dataset, the modeler can decide which variables are more or less important for inclusion or exclusion in predictive models. The fact that OR is usually computed on binary variables, the initial step of the proposed algorithm would be to convert the dataset to binary values. This will be achieved by rounding to 0, for each explanatory variable, all data points below 0.5, and rounding to 1 all data points from 0.5 and above. It should be pointed out that the algorithm will require a dataset that has been normalized to the scale [0, 1]. Secondly, the algorithm separately sums input/output permutations (0,0), (1,1), (0,1), and (1,0) for each predictor in all observations of the dataset. Thirdly, the OR formula is deployed for computation using the values obtained in step 2. Fourthly, the algorithm outputs the OR for each predictor variable and the name of the corresponding predictor in the order they appear in the dataset. The OR value translates to the importance rank of the corresponding predictor. As a filter method, the algorithm will work independently of any machine learning algorithm. The task is to generate variable importance ranking that should guide users in choosing the relevant subset from the variable domain that will produce better classification results. The ranking will be anchored on the strength of association between a predictor variable and the class variable. Any modeling tool at the disposal of the user can be deployed for machine learning using best subsets suggested by the proposed algorithm.

Classification is one of the techniques of data mining that is concerned about developing models that accurately distinguish the class of one data object from another (Rani, Rao, & Lakshmi, 2014). After this is done, the developed model is then used to predict the class of objects whose class is not known. Being a form of supervised learning, classification learning algorithms usually operate with predefined class labels (Mastrogiannis, Boutsinas, & Giannikos, 2009). Apart from predicting the class label of data objects, this technique is also used in predicting missing data values in a given dataset. In order to test the performance of a classification model, its classification accuracy is evaluated; which is the ratio of correctly predicted classes to total number of observations. One of the methods of evaluating classification accuracy is k-fold cross-validation, also referred to as re-sampling (Anguita, Ghelardoni, Ghio, Oneto, & Ridella, 2012). This method divides the dataset into k subsets, uses k-1 subsets to train the classifier and then tests its performance on one subset. The process is done iteratively, reshuffling subsets at every iteration, until accuracy has been evaluated on all vectors (Anguita et al., 2012; Jung & Hu, 2015). Variable selection has been identified as an important step towards constructing classification models that achieve higher accuracy (Kaushik, 2016). Inclusion of variables with little or no modeling value in machine learning negatively affects the predictive power of classifiers. The algorithm to be developed in this research is expected to offer a good alternative to existing filter methods of variable selection.

Key Terms in this Chapter

Classification: In data mining, classification is a supervised learning activity concerned about developing models that can accurately predict the class labels of vectors whose classes are unknown.

Odds Ratio: Statistical measure of the strength of association between two binary variables, where one variable is dependent on the other.

Big Data: Very large and complex datasets that cannot be manipulated by traditional data processing methods.

K-fold Validation: One of the methods of testing classification accuracy where the dataset is split into k subsets and in each iteration, k-1 of the subsets are used for model training while one subset retained for testing of model performance.

Data Mining: The process of extracting hidden but useful information from data sources.

Filter Variable Selection: The process of using statistical techniques, independent of a machine learning algorithm, to evaluate the correlation between each predictor with the outcome variable.

Classification Accuracy: A performance measure of the ability of a classifier to predict classes of unknown vectors.

Feature Selection: The process of assigning a numeric value, or some other form of quantifier, to individual predictors in a dataset, indicating the level of their importance in predicting the outcome.

Complete Chapter List

Search this Book:
Reset