Molecular classification involves the classification of samples into groups of biological phenotypes. Studies on molecular classification generally focus on cancer for the following reason: Molecular classification of tumor samples from patients into different molecular types or subtypes is vital for diagnosis, prognosis, and effective treatment of cancer (Slonim, Tamayo, Mesirov, Golub, and Lander, 2000). Traditionally, such classification relies on observations regarding the location (Slonim et al., 2000) and microscopic appearance of the cancerous cells (Garber et al., 2001). These methods have proven to be slow and ineffective; there is no way of predicting with reliable accuracy the progress of the disease, since tumors of similar appearance have been known to take different paths in the course of time. With the advent of the microarray technology, data regarding the gene expression levels in each tumor sample may now prove to be a useful tool in molecular classification. This is because gene expression data provide snapshots of the activities within the cells and thus, the profile of the state of the cells in the tissue. The use of microarrays for gene expression profiling was first published in 1995 (Schena, Shalon, Davis, and Brown, 1995). In a typical microarray experiment, the expression levels of up to 10,000 or more genes are measured in each sample. The high-dimensionality of the data means that feature selection (FS) plays a crucial role in aiding the classification process by reducing the dimensionality of the input to the classification process. In the context of FS, the terms gene and feature will be used interchangeably in the context of gene expression data.
The objective of FS is to find from an overall set of N features, the subset of features, S, that gives the best classification accuracy. This feature subset is also known as the predictor set. There are two major types of FS techniques, filter-based and wrapper techniques. Filter-based techniques have several advantages over wrapper techniques:
Filter-based techniques are computationally less expensive than wrapper techniques.
Filter-based techniques are not classifier-specific; they can be used with any classifier of choice to predict the class of a new sample, whereas with wrapper-based techniques, the same classifier which has been used to form the predictor set must also be used to predict the class of a new sample. For instance, if a GA/SVM (wrapper) technique is used to form the predictor set, the SVM classifier (with the same classifier parameters, e.g., the same type of kernel) must then be used to predict the class of a new sample.
More importantly, unlike the typical ‘black-box’ trait of wrapper techniques, filter-based techniques provide a clear picture of why a certain feature subset is chosen as the predictor set through the use of scoring methods in which the inherent characteristics of the predictor set (and not just its prediction ability) are optimized.
Currently, filter-based FS techniques can be grouped into two categories: rank-based selection (Dudoit, Fridlyand, and Speed, 2002; Golub et al., 1999; Slonim et al., 2000; Su, Murali, Pavlovic, Schaffer, and Kasif, 2003; Takahashi & Honda, 2006; Tusher, Tibshirani, and Chu, 2001) and state-of-the-art equal-priorities scoring methods (Ding & Peng, 2005; Hall & Smith, 1998; Yu & Liu, 2004). This categorization is closely related to the two existing criteria used in filter-based FS techniques. The first criterion is called relevance – it indicates the ability of a gene in distinguishing among samples of different classes. The second criterion is called redundancy – it indicates the similarity between pairs of genes in the predictor set. The aim of FS is to maximize the relevance of the genes in the predictor set and to minimize the redundancy between genes in the predictor set.