TopRedundancy Of Features
A typical microarray dataset suffers from the shortage of samples while having features in abundance. It is obvious that not all measured features are indeed good predictors of a disease. Since there are a plenty of redundant features, there should be the way to detect them. (Yu, 2008) argues that wrapper models for gene selection, utilizing a classifier to judge on gene subset goodness, are computationally very expensive, and there is always a selection bias due to the learning algorithm used. Yu suggested applying the filter models for gene selection instead since they tend to be much faster than the wrapper ones.
However, many filter models treat each gene separately from other genes. The outcome of a typical filter is the ranked list of genes ordered according to a certain measure of relevance to the target class. From this list, only a few top genes are then selected. Such an approach has linear time complexity with respect to the total number of genes, but it cannot remove all redundant genes, since, for example, two highly ranked genes may duplicate each other because they are highly correlated (Ding & Peng, Minimum redundancy feature selection from microarray gene expression data, 2003), (Ding & Peng, Minimum redundancy feature selection from microarray gene expression data, 2005), (Peng, Long, & Ding, 2005)1. As a result, no gain in performance can be achieved, which sounds pretty much frustrating to you, right?
The keen reader might also notice a hidden problem with filters: how to determine threshold separating relevant and irrelevant genes in the ranked list? Sadly to say, but it is often heuristically chosen or its exact value can be only found with the help of domain-specific information, which might not be always available (no biologist, no chemist, no medical doctor nearby, and your machine learning/data mining/computer science/electrical engineering/mathematical/physical/any relevant to you knowledge unfortunately stops here). But nothing to do: this is the price to pay if we are in hurry and refused the help of a learning algorithm in this tough work. Thus, perhaps faced with similar questions and problems, Yu decided that automatic gene selection is absolutely necessary that would satisfy the following requirements: it belongs to the filter model for feature selection; it does not need to specify any threshold for gene selection; it removes both irrelevant and redundant genes, thus delivering a small set of biologically relevant genes. Nice intentions! Let us see how Yu succeeded to accomplish them.