The importance and the impact of computational work in biology is ever increasing since large scale biological data of various types were accumulated in the past decade, which include genome sequences, protein-protein interaction networks, metabolomes, genome-scale transcriptions, and gene expression patterns. Those data contain key information for understanding orchestrated behavior of molecules in biological systems that is essential to sustain life. It is expected that bioinformatics play significant roles in analyzing such data, as computational techniques, e.g. clustering, feature characterization, data mining, and modeling, are indispensable in the analyses.
Of a particularly important and interesting problem is function prediction of proteins from the tertiary structures, as the structural genomics projects (Burley, 2000; Westbrook et al., 2003; Zhang & Kim, 2003) have been solving an increasing number of protein structures of unknown function. Indeed, as of May 2009, there are over 2800 proteins of unknown function in the Protein Structure Databank (PDB) (Berman et al., 2000). These proteins are remained as unknown function because so far no one has yet conducted experiments to characterize their function, and moreover, conventional sequence comparison based methods (Hawkins & Kihara, 2007), e.g. homology searches (Altschul et al., 1990; Altschul et al., 1997; Pearson & Lipman, 1988), functional motif (Hulo et al., 2006), and domain searches (Coggill et al., 2008), did not find significant similarity against protein sequences of known function. Ongoing efforts for better function prediction include development of sequence-based methods which are more sensitive and accurate than the conventional methods (Chitale et al., 2009a; Hawkins et al., 2008). For example, our group has recently developed two sequence-based methods, named the automated Protein Function Prediction (PFP) method (Hawkins et al., 2006; Hawkins et al., 2009) and the Extended Similarity Group (ESG) method (Chitale et al., 2009b), which efficiently and accurately mine function information from PSI-BLAST searches.
Alternatively, one can use the tertiary structure information for capturing similarity to proteins with known function that are stored in PDB (Thornton et al., 2000). Potential advantages of using structure information are two folds: Firstly, evolutionarily more distantly related proteins to a query protein could be identified because the global structure is better conserved than the primary sequence (Chothia & Lesk, 1986; Kihara & Skolnick, 2004). Secondly, physical features of functional local sites of proteins can be directly compared where interactions with ligand molecules or other proteins take place (Laskowski et al., 2005). A number of methods have been proposed which use local structure as a key feature for predicting function of proteins. Since small ligand molecules usually bind to a protein at its surface pocket regions, simply identifying pockets in the protein surface can identify active sites of enzyme in most of the cases (Li et al., 2007; Laskowski et al., 1996; Kawabata & Go, 2007). Programs which identify pockets include Visgrid (Li et al., 2007), POCKET (Levitt & Banaszak, 1992), LIGSITE (Hendlich et al., 1997; Huang & Schroeder, 2006), SURFNET (Laskowski, 1995), and PocketDepth (Kalidas & Chandra, 2008). An identified pockets can be further compared with known ligand binding pockets in a database to make prediction of the type of ligand that binds to it (Kahraman et al., 2007; Tseng et al., 2009; Kihara et al., 2009; Chikhi & Kihara, 2009; Binkowski & Joachimiak, 2008; Kalidas & Chandra, 2008; Yeturu & Chandra, 2008). In these methods, pockets are characterized by geometrical shapes, amino acid residues at pockets, and physicochemical properties such as the electrostatic potentials and hydrophobicity.