Using a Bio-Inspired Algorithm to Resolve the Multiple Sequence Alignment Problem

Using a Bio-Inspired Algorithm to Resolve the Multiple Sequence Alignment Problem

El-amine Zemali (University of Science and Technologies HouariBoumedienne (USTHB), Algiers, Algeria) and Abdelmadjid Boukra (University of Science and Technology Houari Boumediene (USTHB), Algiers, Algeria)
Copyright: © 2016 |Pages: 20
DOI: 10.4018/IJAMC.2016070103
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

One of the most challenging tasks in bioinformatics is the resolution of Multiple Sequence Alignment (MSA) problem. It consists in comparing a set of protein or DNA sequences, in aim of predicting their structure and function. This paper introduces a new bio-inspired approach to solve such problem. This approach named BA-MSA is based on Bat Algorithm. Bat Algorithm (BA) is a recent evolutionary algorithm inspired from Bats behavior seeking their prey. The proposed approach includes new mechanism to generate initial population. It consists in generating a guide tree for each solution with progressive approach by varying some parameters. The generated guide tree will be enhanced by Hill-Climbing algorithm. In addition, to deal with the premature convergence of BA, a new restart technique is proposed to introduce more diversification when detecting premature convergence. Balibase 2.0 datasets are used for experiments. The comparison with well-known methods as MSA-GA MSA-GA (w\prealign), ClustalW, and SAGA and recent method (BBOMP) shows the effectiveness of the proposed approach.
Article Preview

1. Introduction

Bioinformatics is an interface between computer science and biology. This field has several open ended problems. MSA is the most known one.

An alignment is a biological sequences comparison to discover similar subsequences. This is motivated by the fact that proteins or genes with similar subsequences are likely to perform the same function. A biological sequence can be a protein sequence composed by different types of amino acids or DNA sequence composed by nucleotides. There are 20 types of amino acid and 04 types of nucleotides. If a sequence alignment involves two sequences, it is called a pair-wise alignment (Needleman & Wunsch, 1970), (Smith & Waterman, 1981). The main goal is to find the similar or closely related parts. If the alignment involves more than two sequences, it is called a multiple sequence alignment. Multiple sequence alignment is a simultaneous arrangement of more than two sequences; each one represented in a different line and can have different length. The major challenge is to emphasize the identical residues in the same column by introducing the symbol “-”, called gap. The MSA is practical for study molecular function and evolution. It allows identifying conserved regions, definition of evolutionary relationships, predicting the secondary or tertiary structures of proteins, gene regulation and polymerase chain reaction (Anbarasu, Narayanasamy, & Sundararajan, 2000). The alignment between sequences can be global or local. In global alignment, the sequences are aligned over their whole length. By contrast, local alignment identifies regions of similarity within a subsequence (Thompson, Plewniak, & Poch, 1999). Local alignments are often preferable, but can be more difficult because of the additional challenge of identifying the similar regions. An effective global alignment technique is the Needleman–Wunsch algorithm (Needleman & Wunsch, 1970). It is based on dynamic programming and produces the optimal pair-wise alignment. In other hand, Smith–Waterman algorithm (Smith & Waterman, 1981) is widely used for local alignment; it was inspired from Needleman–Wunsch algorithm.

The optimal alignment is one that describes the most probable evolutionary scenario between all sequences. It generally maximizes the mathematical score. Optimal alignment for two sequences can be found using dynamic programming. But for greater than 3 sequences, the complexity grows considerably. For N sequences with length equal to L the computational complexity is O (Ln). So, MSA is an NP-hard problem (Wang & Jiang, 1994). The high complexity of the problem leads to use heuristics to find a near optimal alignment with a reasonable cost. Several methods have been developed. They are generally classified into two main classes: progressive approach and iterative approach.

The progressive approach is the most commonly used for multiple sequence alignment. It produces a reasonable alignment for large sequences number with lower computational cost. The main idea of the progressive alignment is to gradually include the sequences in their evolutionary relationships order depicted by a tree, named guide tree. This order has an immediate impact on the alignment result. The guide tree construction involves three steps. In the first step all pair-wise alignments are performed using dynamic programming. In the second step, the pair-wise distances table is computed, finally the resulting table is used to construct a guide tree using grouping techniques such as Neighbor-Joining (NJ) (Saitou & Nei, 1987). The efficiency of progressive approaches depends widely on the order depicted by the guide tree. Consequently, they usually converge to local optima.

The other alternative is the use of iterative approach. This approach starts with an initial solution in which all sequences are aligned simultaneously. The current solution is improved using iterative steps in aim of optimizing the fitness score. The iterative methods have not been widely adopted, because they are not efficient while none of fitness score can effectively guide the algorithm to the global optimal and they need a high computational time (Ji, Ye, Yang, & Guo, 2009).

Complete Article List

Search this Journal:
Reset
Open Access Articles: Forthcoming
Volume 9: 4 Issues (2018): 1 Released, 3 Forthcoming
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing