Introduction to Feature and Gene Selection

Introduction to Feature and Gene Selection

DOI: 10.4018/978-1-60960-557-5.ch008
OnDemand PDF Download:
$30.00
List Price: $37.50

Chapter Preview

Top

Problem Of Feature Selection

One can claim that times when features were scarce in many branches of science and technology are nowadays in the past. It could be said that feature abundance is a blessing but on the contrary and to surprise of many, it is not. It could be thought that having more features brings more discriminating power, thus facilitating classification. However, in practice, this causes problems: redundant and irrelevant features lead to the increased complexity of classification and degrade classification accuracy.

Hence, some features have to be removed from the original feature set in order to mitigate these negative effects before a classifier is utilized. The task of redundant/irrelevant feature removal is termed feature selection in machine learning and data mining literature. It is a data dimensionality reduction1 when the original set of features is reduced to another set , where the symbol means ‘subset of or equal to’, implying that it is not impossible to have an irreducible set of features in certain cases2. However, for microarray data this is surely not the case, since there are thousands and tens of thousands of features (gene expression levels) in each dataset. In analyzing high dimensional microarray data, one is interested in retaining just a few genes out of many thousands. Thus gene selection becomes synonymous to feature selection, which means that many existing feature selection methods can be readily applied to gene selection.

There is huge interest to gene selection among researches working in bioinformatics (see, e.g., links to collected journal articles at http://www.nslij-genetics.org/microarray/). This is because gene selection represents a challenging and important task for both biology and machine learning.

By selecting a small fraction of genes from a microarray, one aims at finding the genes that can be used as indicators of a certain disease or even early predictors of that disease. Since different types of cancer threaten humankind with great persistence, the overwhelming majority of articles about gene selection apply theoretical ideas and methods to a very practical problem related to cancer.

From the machine learning point of view, feature selection removes meaningless, i.e. not related to a studied disease, genes, thus mitigating overfitting of a classifier on high dimensional microarray data. Overfitting is plague when there are a lot of features and only few samples or instances characterized by these features. Overfitting leads to very good and often perfect classification performance (zero or close to zero error rate) on the training data, but this seemingly wonderful result does automatically translate to new, out-of-sample data. Put it differently, a researcher neglecting the harmful effect of overfitting in the case of microarray data would find a small set of genes which he claims to predict a certain type of cancer. However, when biologists and/or doctors try to pay attention to expression levels of these genes when observing test volunteers and/or real patients, they see no value of those genes, because during machine learning stage, healthy and diseased patients were separated purely based on the noise present in microarray measurements rather than on the disease presence or absence. This happened because without prior removal irrelevant genes, the classification problem is known as the small sample size problem (the number of features far exceeds the number of samples in a dataset) in statistics and machine learning. For such problems, the lack of classifier generalization to new data is a norm rather than an exception, unless a data dimensionality is dramatically reduced.

Thus, the goal in microarray data classification is to identify the differentially expressed genes that can be used to predict class membership of new, unseen samples. The classification of gene expression data involves feature selection and classifier design. Feature selection identifies the subset of differentially-expressed genes that are good (useful, relevant) for distinguishing different classes of samples.

Complete Chapter List

Search this Book:
Reset