Feature Selection Algorithms for Classification and Clustering in Bioinformatics

Feature Selection Algorithms for Classification and Clustering in Bioinformatics

Sujata Dash (Gandhi Institute for Technology, India) and Bichitrananda Patra (KMBB College of Engineering and Technology, India)
DOI: 10.4018/978-1-5225-1759-7.ch085
OnDemand PDF Download:
No Current Special Offers


This chapter discusses some important issues such as pre-processing of gene expression data, curse of dimensionality, feature extraction/selection, and measuring or estimating classifier performance. Although these concepts are relatively well understood among the technical people such as statisticians, electrical engineers, and computer scientists, they are relatively new to biologists and bioinformaticians. As such, it was observed that there are still some misconceptions about the use of classification methods. For instance, in most classifier design strategies, the gene or feature selection is an integral part of the classifier, and as such, it must be a part of the cross-validation process that is used to estimate the classifier prediction performance. Simon (2003) discussed several studies that appeared in prestigious journals where this important issue is overlooked, and optimistically biased prediction performances were reported. Furthermore, the authors have also discuss important properties such as generalizability or sensitivity to overtraining, built-in feature selection, ability to report prediction strength, and transparency of different approaches to provide a quick and concise reference. The classifier design and clustering methods are relatively well established; however, the complexity of the problems rooted in the microarray technology hinders the applicability of the classification methods as diagnostic and prognostic predictors or class-discovery tools in medicine.
Chapter Preview

2. Feature Selection Techniques

As many pattern recognition techniques were originally not designed to cope with large amounts of irrelevant features, combining them with FS techniques has become a necessity in many applications (H. Liu and L. Liu, 2005). The objectives of feature selection are manifold, the most important ones being:

  • 1.

    To avoid over fitting and improve model performance, i.e. prediction performance in the case of supervised classification and better cluster detection in the case of clustering,

  • 2.

    To provide faster and more cost-effective models, and

  • 3.

    To gain a deeper insight into the underlying processes that generated the data.

However, the advantages of feature selection techniques come at a certain price, as the search for a subset of relevant features introduces an additional layer of complexity in the modelling task. Instead of just optimizing the parameters of the model for the full feature subset, we now need to find the optimal model parameters for the optimal feature subset, as there is no guarantee that the optimal parameters for the full feature set are equally optimal for the optimal feature subset (W. Daelemans et al., 2003). As a result, the search in the model hypothesis space is augmented by another dimension: the one of finding the optimal subset of relevant features. Feature selection techniques differ from each other in the way they incorporate this search in the added space of feature subsets in the model selection.

Complete Chapter List

Search this Book: