A Web Database IR-PDB for Sequence Repeats of Proteins in the Protein Data Bank

A Web Database IR-PDB for Sequence Repeats of Proteins in the Protein Data Bank

Selvaraj Samuel (Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, India) and Mary Rajathei (Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, India)
Copyright: © 2017 |Pages: 10
DOI: 10.4018/IJKDB.2017070101
OnDemand PDF Download:
List Price: $37.50


Amino acid repeats play significant roles in the evolution of structure and function of many large proteins. Analysis of internal repeats of protein with known structure helps to understand the importance of repeats of the protein. A database IR-PDB for repeats in sequence of the proteins in the PDB has been developed for the analysis of impact of repeats in proteins. Using the state of the art repeat detection method RADAR, internal repeats in 148202 sequences out of 285714 sequences belonging to 115031 PDB structures were detected. The identified sequence repeats were annotated with secondary structural information with a view to analyze the structural consequence and conservation of the repeats. The tertiary structure of the repeats and their functional involvements can be found out through web links to PDB, PDBsum and Pfam. IR-PDB is systematically annotated for the the proteins in the PDB with sequence repeats and their structure with the possibility to access the dataset interactively through web services.
Article Preview


A large portion of proteins contain repeated segments of amino acids that often correspond to structural and functional units. The percentage of repeats containing proteins grows with the complexity of the organism which suggests that internal duplication is an important mechanism for the evolution of multi-cellular organisms. The repeat length varies considerably from few amino acids of shorter repeats (<50 in length) to larger span of domain repeats that can be present as repeat pair of single repeat to multiple number of repeats. Based on the distance between adjacent units, they are classified as tandem repeats of continuously distributed and non-tandem repeats of sequentially interspersed. It has been observed that single amino acid/homopeptide repeats (Jorda & Kajava, 2010), oligopeptide (2-20 amino acids in length) (Fraser & MacRae, 1973) and greater than >20 amino acid repeats are involved in various diseases like neurodegenerative disorders, cancer, muscular dystrophy, and others (Djian, 1998; Peruz,1999; Orr et al., 2007; Burchel et al., 2006). It has been suggested that array of repeats provide regular spatial and functional groups which are useful for structural packing or for interactions with target molecules (Katti et al., 2000). Further, the involvement of different repeat types of length less than 60 such as tetratricopeptide, leucine-rich repeats, ankyrin and armadillo/heat repeats (Fraser & MacRae, 1973; Kobe & Kajava, 2001; Grover & Barford, 1999; Yoder et al., 1993) in various structures and functions of the proteins has been highlighted (Andrade et al., 2001). Analysis of repeat pairs of length >50 at the structure of the protein has shown that most of the sequence repeats adopt similar fold in spite of divergence and are involved in the function of the protein (Mary & Selvaraj, 2013). Study on conservation of tertiary structure between repeats in functional units (domains) of protein using structure based parameters suggests that equivalent residues in the repeated segments share similar tertiary environment for adopting similar fold (Mary, Saravanan, & Selvaraj, 2015).

A number of servers are available to detect sequence repeats in proteins, based on different algorithms. Web servers such as XSTREAM (Newman & Cooper, 2007), T-REKS (Jorda & Kajava, 2009) are based on short string extension algorithms which can identify tandem repeats with insertions and deletions of relatively short (less than 15-20 residues) repeats. The RADAR and TRUST web servers (Heger & Holm, 2000; Szklarczyk & Heringa, 2004) are efficient for the detection of long repeats (repetitive units of more than 15 residues) by comparing a protein sequence to itself. On the other hand, the TPRpred tool (Karpenahalli et al., 2007) and REP method (Andrade et al., 2000) use a priori generated alignments to construct Hidden Markov Models (HMMs) or sequence profiles (Bucher et al., 1996; Gribskov et al., 1987) to detect repeats. The profiles or HMMs from these sets are compared one by one to the query sequence in search of the best and multiple hits. Finally, HHrepID (Biegert & Soding, 2008) is a method that relies on both HMM-HMM or profile-profile comparison for ‘ab initio’ detection of tandem repeats.

Complete Article List

Search this Journal:
Open Access Articles: Forthcoming
Volume 7: 2 Issues (2017)
Volume 6: 2 Issues (2016)
Volume 5: 2 Issues (2015)
Volume 4: 2 Issues (2014)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing