Improving PSI-BLAST’s Fold Recognition Performance through Combining Consensus Sequences and Support Vector Machine

Improving PSI-BLAST’s Fold Recognition Performance through Combining Consensus Sequences and Support Vector Machine

Ren-Xiang Yan (China Agricultural University, China), Jing Liu (China Agricultural University, China) and Yi-Min Tao (China Agricultural University, China)
DOI: 10.4018/978-1-60960-064-8.ch005
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Profile-profile alignment may be the most sensitive and useful computational resource for identifying remote homologies and recognizing protein folds. However, profile-profile alignment is usually much more complex and slower than sequence-sequence or profile-sequence alignment. The profile or PSSM (position-specific scoring matrix) can be used to represent the mutational variability at each sequence position of a protein by using a vector of amino acid substitution frequencies and it is a much richer encoding of a protein sequence. Consensus sequence, which can be considered as a simplified profile, was used to improve sequence alignment accuracy in the early time. Recently, several studies were carried out to improve PSI-BLAST’s fold recognition performance by using consensus sequence information. There are several ways to compute a consensus sequence. Based on these considerations, we propose a method that combines the information of different types of consensus sequences with the assistance of support vector machine learning in this chapter. Benchmark results suggest that our method can further improve PSI-BLAST’s fold recognition performance.
Chapter Preview
Top

Computational Models And Methods

Dataset

The protein sequences were extracted from the SCOP ASTRAL Compendium (Andreeva et al., 2004) database (1.73 version) filtered by 10% sequence identity and an e-value threshold of 0.01. Moreover, we also excluded sequences that only contained a single superfamily member and sequences that are too short (< 60 amino acid). And every family of proteins was reserved only a representative one. Membrane proteins, small proteins, multi-domain proteins were removed. Finally, 1754 protein sequences remained. The dataset was named as SCOP1754, which covers 409 different superfamilies and 294 different folds.

Complete Chapter List

Search this Book:
Reset