Penguin Search Optimisation Algorithm for Finding Optimal Spaced Seeds

Penguin Search Optimisation Algorithm for Finding Optimal Spaced Seeds

Youcef Gheraibia, Abdelouahab Moussaoui, Youcef Djenouri, Sohag Kabir, Peng-Yeng Yin, Smaine Mazouzi
DOI: 10.4018/IJSSCI.2015040105
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

This paper develops PeSeeD, a new metaheuristic algorithm for finding optimal spaced seed. Sequences matching is a hot topic in bio-informatics, which is used in many applications such as understanding the functional, structural, or evolutionary relationships between the sequences. The most relevant sequences matching methods are based on seeds designed to match two biological sequences. The first approach which introduced seeds was facilitated via Blastn tool, the approach builds seeds of 11 length size. However, it is clear that not all local alignments have to include an identical fragment of length 11. The spaced seeds approach is one of the methods which does not require a consecutive matching position. Dynamic programming is used to solve this kind of problem and it takes quadratic time. Several approaches have then been proposed to improve the sensitivity of searching in reasonable runtime. To reduce the complexity of such approaches, other heuristics based approaches can also be reviewed. The aim is to find spaced seeds subset which improves sensitivity without increasing the computation time. In this paper, the optimal subset spaced seeds are explored using the bio-inspired approach, penguins search optimisation algorithm (‘'PeSOA'' for short). The authors further propose an efficient heuristic for computing the overlap complexity between seeds. To evaluate the efficiency of the proposed approach, they compared the obtained results with the results of several seeds based software tools. The obtained results are very promising in terms of sensitivity and computation time for the overlap complexity.
Article Preview
Top

1. Introduction

Biological instances contain a large biological information with different types of data (sequence, structure...). These latter are either proteins (Uniprot) or nucleic sequence (EMBL, GenBank...). Thus, realising similarity search among these instances in a reasonable time is a difficult task. Undoubtedly, developing an efficient algorithm based on spaced seeds is a big challenge for bio-informatics community. The use of spaced seeds performs specialised optimisation for next generation sequencing. Several methods for solving this problem have been recently proposed and can be classified in two categories. The first one employs dynamic programming and can find an exact solution with quadratic time complexity (Smith et al., 1981). As biological databases grow larger, this exact approach is usually required a high time consuming. The Second category is the heuristic algorithms (Lipman et al., 1985) which can achieve good solutions in a reasonable time. The heuristic methods are based on approximate string matching.

The alignment of two biological sequences based on seeds is an efficient technique used by several algorithms in order to produce other sequencing data generations. Among these algorithms, the 11 consecutive matches of BLAST (Altschul et al., 1990) is called a contiguous seed, denoted as 11111111111 for eleven consecutive matches. It is required to find an identical stretch of length 11 which is not always feasible. In order to increase the probability to find an alignment, PatternHunter II (Li et al., 2004) uses one or several non-contiguous seeds called spaced seeds. Concretely, each spaced seed S is a vector of n elements where n is the length of each seed and their position is defined as follows: S[i] = 1 if the position need required matching and S[i] = 0 if we do not care about position matching, the number of ones in the seed called the weight of the seed. This sensitivity is used to evaluate the quality of a spaced seed for matches alignment. The objective of spaced seed is to increase sensitivity without reducing the computation time performance. The sensitivity is approximated by matching the spaced seeds and a Bernoulli representation of the alignment.

Homology search in biological sequences is a very important task for discovering and understanding similarities among genes and proteins, in order to find similar segments, or local alignments, between two DNA or protein sequences (Altschul et al., 1990). The sizes of DNA and protein databases become very large, such as the EMBL Nucleotide Sequence Database (EMBL-Bank) has increased in size from around 600 entries in 1982 to over 6.2×108 by MARCH 2015, so homology search is very time consuming and far to be done in reasonable time (Altschul et al., 1990).

Any optimisation problem such as finding optimal spaced seeds have two conflicting factors, computation time (searching speed) and solution quality (sensitivity). The aim of all previous methods is to design in reasonable time a good set of spaced seeds having high sensitivity. So there is a tradeoff between the computation process and the sensitivity (Choi et al., 2004). Indeed, we can increase the sensitivity by decreasing the required weight of the hit, nevertheless, the decreasing of the weight of hits will increase the runtime and also increase the number of fallacious hits.

The Penguins search optimisation algorithm (Gheraibia et al., 2013) is a meta-heuristic based on hunting behaviour of penguins used for solving complex problems. The hunting strategy of penguins is more than fascinating since they can collaborate their efforts and synchronise their dives to optimise the global energy in the process of collective hunting.

Complete Article List

Search this Journal:
Reset
Volume 16: 1 Issue (2024)
Volume 15: 1 Issue (2023)
Volume 14: 4 Issues (2022): 1 Released, 3 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing