Article Preview
Top1. Introduction
Proteins play a fundamental role in all living organisms and are involved in a variety of molecular functions and biological processes. Proteins are the most essential and versatile macromolecules of life, and the knowledge of their functions is an essential link in the development of new drugs, better crops, and even the development of synthetic biochemical such as biofuels (Nemati et al., 2009; Azar 2021). Proteins are composed of one or more chains of amino acids and show several levels of structure. The primary structure is defined by the sequence of amino acids, while the secondary structure is defined by local, repetitive spatial arrangements such as helix, strand, and coil. The 3D structure of proteins is uniquely determined by their amino acid sequences. The tertiary structure is defined by how the chain folds into a three-dimensional configuration (Cao et al., 2006).
Significant progress has been made in protein structure prediction during the past decade (Chen et al., 2009; Ding et al., 2009; Floudas., 2007; Zou et al., 2011), including comparative modeling (Cheng, 2008), fold recognition and threading (Wu and Zhang, 2007; Xu et al., 2008), first principles prediction with and without database information (McAllister and Floudas, 2010; Rajgaria et al., 2010; Subramani et al., 2009). The structure of proteins is very complex but their main structural arrangement is surprisingly simple and regular. In fact, according to their chain folding pattern, proteins are usually folded into four structural classes such as all α, all β, all α + β and all α / β (Krajewski and Tkacz, 2013a,b).
Traditionally, computational prediction methods use features that are derived from protein sequence, protein structure or interaction networks predict function (Rost et al., 2003; Rentzsch and Orengo, 2009; Koubaa et al., 2020). In this paper, the features are extracted from protein primary sequence, based on amino acid composition and K-mer patterns, or K-grams or K-tuples (Bagyamathi and Inbarani, 2015; Chandran, 2008). The K-gram representation of features, used for protein sequence classification, usually result in prohibitively high-dimensional input spaces, for large values of K. Unfortunately, for very high dimensional data with hundreds of thousands of dimensions, processing data instances into feature vectors at runtime using these models is computationally expensive due to inference at runtime in most of the cases. A less expensive approach to dimensionality reduction is feature selection, which reduces the number of features by selecting a subset of the available features based on some chosen criteria (Guyon and Elisseeff, 2003; Fleuret, 2004; Hassanien et al., 2014a, 2015; Azar, 2013, 2014; Banu et al., 2017; Azar and Hassanien, 2015 ; Azar and El-Said, 2014, 2013a,b).