Bio-Inspired Data Mining for Optimizing GPCR Function Identification

Bio-Inspired Data Mining for Optimizing GPCR Function Identification

Safia Bekhouche, Yamina Mohamed Ben Ali
DOI: 10.4018/IJCINI.20211001.oa40
Article PDF Download
Open access articles are freely available for download

Abstract

GPCR are the largest family of cell surface receptors; many of them still remain orphans. The GPCR functions prediction represents a very important bioinformatics task. It consists in assigning to the protein, the corresponding functional class. This classification step requires a good protein representation method and a robust classification algorithm. However the complexity of this task could be increased because of the great number of GPCRs features in most databases, which produce combinatorial explosion. In order to reduce complexity and optimize classification, the authors propose to use bio-inspired metaheuristics for both the feature selection and the choice of the best couple (feature extraction strategy (FES), data mining algorithm (DMA)). The authors propose also to use the BAT algorithm for extracting the pertinent features and the Genetic Algorithm to choose the best couple. They compared the results they we obtained with two existing algorithms. Experimental results indicate the efficiency of the proposed system.
Article Preview
Top

1. Introduction

The identification of G-protein coupled receptors (GPCRs) function is an area of current interest in pharmaceutical and biological research. Of the approximately 500 clinically marketed drugs, greater than 30% are modulators of GPCR function, making GPCRs the most successful of any target class in terms of drug discovery (Drews 2000).

Intense efforts have been devoted to identifying new GPCR functions for orphans. However, for many GPCRs, such efforts have failed to yield reliable results.

At this stage several questions have been asked: what are the necessary steps for good protein function identification? What is the adequate protein representation method (PRM) that can be used to extract features and construct numerical attribute vectors? Which Data Mining Algorithm (DMA) that should be selected to make an accurate classification? How to avoid the combinatorial explosion of classification algorithms due to the complex nature of protein data?

Although many GPCR function prediction approaches have been proposed, a great number of GPCR are still orphan. The previous common methodology is sequence similarity searching in protein databases which is mainly based on pairwise sequence alignment such as

BLAST (Zhang et al., 2012). But it is difficult to identify GPCR successfully because there are no significant shared sequence similarities. However, two proteins can have very different sequences and perform a similar function, or have very similar sequences and perform different functions (Nemati et al., 2009). To solve this problem, some statistical and machine learning approaches have been developed (Secker et al., 2007).

There are three major problems in the task of computational protein function prediction with classification algorithms, which are the choice of the classification algorithm and the choice of the PRM, also the selection of relevant attributes to avoid the combinatorial explosion problem. Those are open problems, even in any classification problem as there are many choices and it is not clear which one is the best.

Generally, there are several strategies to extract attributes from a protein sequence, and the choice of the PRM might be as important as the choice of the DMA, contrary to few works (King et al., 2001) that are often overlooked the used feature extraction strategy and more focused on which classification algorithm to use. Other researchers have developed a hybrid feature extraction strategy (Rehman & Khan, 2011) that can exploit both pseudo-amino-acid composition strategy (PseAAC) and multiscale energy representation, while some authors (Secker et al., 2010; Naveed & Khan, 2012) have made a comparison of the predictive accuracies of few PRM in protein classification.

The transformation of the protein chain can give an enormous numerical attribute vector, the size and the components of this later, strongly influences the predictive accuracy and the error rate of the classification. To improve these rates it’s strictly necessary to eliminate noises “redundancies or useless information” present in the examples to be classified. Furthermore, datasets with hundreds and thousands of attributes may cause the curse of dimensionality and combinatorial explosion problems (Chen et al., 2014).

One of the most feasible techniques to cope with this problem is feature selection (FS) (Sayes et al., 2007; Bagherzadeh-Khiabani et al., 2016) to optimize the classification model and improve the performance measurements. This technique is widely used in different fields to improve results such as: protein function prediction (Nemati et al., 2009) and it is mostly used in big data and data mining (Li & Liu, 2017; Tupe & Wakchaure, 2017).

Complete Article List

Search this Journal:
Reset
Volume 18: 1 Issue (2024)
Volume 17: 1 Issue (2023)
Volume 16: 1 Issue (2022)
Volume 15: 4 Issues (2021)
Volume 14: 4 Issues (2020)
Volume 13: 4 Issues (2019)
Volume 12: 4 Issues (2018)
Volume 11: 4 Issues (2017)
Volume 10: 4 Issues (2016)
Volume 9: 4 Issues (2015)
Volume 8: 4 Issues (2014)
Volume 7: 4 Issues (2013)
Volume 6: 4 Issues (2012)
Volume 5: 4 Issues (2011)
Volume 4: 4 Issues (2010)
Volume 3: 4 Issues (2009)
Volume 2: 4 Issues (2008)
Volume 1: 4 Issues (2007)
View Complete Journal Contents Listing