Article Preview
Top1. Introduction
Several advanced genomic technologies developed last year's (DNA microarrays, NGS and RNAseq…), especially during the sequencing the human genome are being very helpful for molecular diagnostics, unveiling new insights into biology and have led to biomarker discovery (Mabert et al., 2014). Certainly, the use of molecular biomarkers will impact different areas of clinical practice and will give precious additional information for tumor diagnosis/prognosis and finally, contribute to personalized therapy of cancer. The ideal biomarker for cancer would have applications in (a) classification of tumors, (b) prognosis of disease progression, (c) prediction of response to therapy, (d) monitoring of response to therapy and serve as a target for drug development (Stoss & Henkel, 2004).
Gene expression microarray is used to survey and measure genes activity in healthy and diseased tissues through various populations. It can measure and record the expression level of thousands of genes simultaneously in different samples types and specific experimental conditions (referred to as a sample) (Bolon-Canedo et al., 2014). In cancer examination these technologies have been broadly investigated for classification of different types of tumors and make the accurate prediction of cancer possible and easier using bioinformatics tools in machine learning and pattern recognition (Wu et al., 2012).
As a general observation, there are several problems studied in genes expression microarrays (GEM). All of them can be divided into three classes namely the class prediction which uses supervised machine learning approaches, the class discovery which uses unsupervised machine learning approaches (Banu & Andrews, 2015) and the class gene comparison that uses machine learning approaches in general (Golub et al., 1999). The direct application of these methods on high-dimensional data is usually ineffective (Wu et al., 2012). Since gene expression data consists of a high number of features (genes) and small sample sizes. However, there are a large number of irrelevant, redundant and noisy genes. Only a small set of genes contains useful biological interpretations and finally gives high accuracy for cancer diagnosis. In addition, the presence of many features affects not only the performance of prediction but also the computational time of learning algorithms (Bolon-Canedo et al., 2014).
To avoid the problem of the curse of dimensionality it becomes then necessary to select a small subset of features/genes that can separate healthy patients from cancer patients or in more general terms, genes which are relevant, non-redundant and discriminative for a particular genetic disease. These genes are called biomarkers, informative genes, parsimonious genes or differentially expressed genes.
Therefore, we require dimensionality reduction techniques, which identify a small set of genes that represent the most discriminant information of the original ensemble of genes to achieve better learning performance. This step plays a central role in the field of machine learning and more specifically in the classification task and allows many pros (Krishnapuram et al., 2004) (a) reduce the computational cost and storage space of the classification model, by constructing them using only a small subset of the original set of genes, (b) Improve significantly the intelligibility of the classifier, and maximize the prediction performance of a classification algorithm and (c) reduce the risk of ‘‘overfitting’’ when the number of samples is small. Subsequently, the prediction result of classifiers is more reliable, robust and can help doctors to take appropriate treatment solution which provide patients with better treatment or response to therapy, especially when the disease has been identified at its early time (Osl et al., 2012).