Article Preview
Top1. Introduction
The identification of G-protein coupled receptors (GPCRs) function is an area of current interest in pharmaceutical and biological research. Of the approximately 500 clinically marketed drugs, greater than 30% are modulators of GPCR function, making GPCRs the most successful of any target class in terms of drug discovery (Drews 2000).
Intense efforts have been devoted to identifying new GPCR functions for orphans. However, for many GPCRs, such efforts have failed to yield reliable results.
At this stage several questions have been asked: what are the necessary steps for good protein function identification? What is the adequate protein representation method (PRM) that can be used to extract features and construct numerical attribute vectors? Which Data Mining Algorithm (DMA) that should be selected to make an accurate classification? How to avoid the combinatorial explosion of classification algorithms due to the complex nature of protein data?
Although many GPCR function prediction approaches have been proposed, a great number of GPCR are still orphan. The previous common methodology is sequence similarity searching in protein databases which is mainly based on pairwise sequence alignment such as
BLAST (Zhang et al., 2012). But it is difficult to identify GPCR successfully because there are no significant shared sequence similarities. However, two proteins can have very different sequences and perform a similar function, or have very similar sequences and perform different functions (Nemati et al., 2009). To solve this problem, some statistical and machine learning approaches have been developed (Secker et al., 2007).
There are three major problems in the task of computational protein function prediction with classification algorithms, which are the choice of the classification algorithm and the choice of the PRM, also the selection of relevant attributes to avoid the combinatorial explosion problem. Those are open problems, even in any classification problem as there are many choices and it is not clear which one is the best.
Generally, there are several strategies to extract attributes from a protein sequence, and the choice of the PRM might be as important as the choice of the DMA, contrary to few works (King et al., 2001) that are often overlooked the used feature extraction strategy and more focused on which classification algorithm to use. Other researchers have developed a hybrid feature extraction strategy (Rehman & Khan, 2011) that can exploit both pseudo-amino-acid composition strategy (PseAAC) and multiscale energy representation, while some authors (Secker et al., 2010; Naveed & Khan, 2012) have made a comparison of the predictive accuracies of few PRM in protein classification.
The transformation of the protein chain can give an enormous numerical attribute vector, the size and the components of this later, strongly influences the predictive accuracy and the error rate of the classification. To improve these rates it’s strictly necessary to eliminate noises “redundancies or useless information” present in the examples to be classified. Furthermore, datasets with hundreds and thousands of attributes may cause the curse of dimensionality and combinatorial explosion problems (Chen et al., 2014).
One of the most feasible techniques to cope with this problem is feature selection (FS) (Sayes et al., 2007; Bagherzadeh-Khiabani et al., 2016) to optimize the classification model and improve the performance measurements. This technique is widely used in different fields to improve results such as: protein function prediction (Nemati et al., 2009) and it is mostly used in big data and data mining (Li & Liu, 2017; Tupe & Wakchaure, 2017).