Article Preview
TopIntroduction
In each living cell of the human organism, a variety of protein interactions (PPIs) take place. In recent years, researchers have tried to approach the problem of predicting all possible protein interactions in the human organism by implementing different computational techniques. At the beginning, most of them were based on the analysis of a sole feature, indicative of interaction between two proteins. Several examples of such features are features concerning the genomic sequence of the genes-generators of the reference proteins, features concerning the structure of the reference proteins, features concerning the sequences of the references proteins and many others (Chua et al., 2006). The most recent computational approaches use various features as inputs for their classifiers in order to take advantage of all the available information (Chen et al., 2005; Fariselli et al., 2002).
Bayesian based classifiers (Howson et al., 1993) are the most common computational methods used to integrate data from a wide variety of sources. Scott et al. (2007) presented a hybrid approach of naïve Bayesian and a full Bayesian classifier. Using the full classifiers they produced a combined feature from the features of co-localization, post translational modifications co-occurrence and domain co-occurrence. In a subsequence step, that feature was combined with a co-expression feature, an orthology feature, a co-disorder feature and a network topology feature using a Bayesian classifier. Their methodology was applied to predict and rank the human PPIs genome-wide. The probabilistic framework offered by Bayesian approaches is capable of producing interpretable classifiers of adequate classification performance and it allows the incorporation of experts’ knowledge. On the contrary, its simplicity leads to performance limitations and forces researchers to apply more sophisticated techniques in the problem of predicting protein interactions.
Two machine learning methods widely used in the problem of PPIs prediction are the Artificial Neural Networks (Haykin, 1998) and the Support Vector Machines (Corrina &Vapnik, 1995). They both have been applied in the past in an enormous variety of classification problems providing very high classification performances. Chen et al. (2006) developed an integrative Artificial Neural Network framework to predict PPIs from heterogeneous data in Human. They used diverse data sources - like protein domain data, molecular function data and biological process annotations - to carry out the prediction. Although Artificial Neural Networks, when applied to the problem of predicting PPIs, have demonstrated very good classification measures, more sophisticated techniques like SVMs, seem to outperform them because of their higher generalization abilities. Bock et al. (2001) used a SVM learning system for training interaction data, with protein sequences and associated physicochemical properties as features. For each amino acid sequence of a protein complex, feature vectors were assembled from encoded representations of several tabulated residue properties, such as charge, hydrophobicity, and surface tension for each residue in sequence. SVMs have proven to provide high prediction accuracy, but to the expense of increased computational cost (Gomez et al., 2003). Hence, it is usually unfeasible to train the SVM with a relatively high collection of training examples. Moreover, the results derived from that method cannot be easily interpreted by biologists.