Article Preview
Top1. Introduction
Biological instances contain a large biological information with different types of data (sequence, structure...). These latter are either proteins (Uniprot) or nucleic sequence (EMBL, GenBank...). Thus, realising similarity search among these instances in a reasonable time is a difficult task. Undoubtedly, developing an efficient algorithm based on spaced seeds is a big challenge for bio-informatics community. The use of spaced seeds performs specialised optimisation for next generation sequencing. Several methods for solving this problem have been recently proposed and can be classified in two categories. The first one employs dynamic programming and can find an exact solution with quadratic time complexity (Smith et al., 1981). As biological databases grow larger, this exact approach is usually required a high time consuming. The Second category is the heuristic algorithms (Lipman et al., 1985) which can achieve good solutions in a reasonable time. The heuristic methods are based on approximate string matching.
The alignment of two biological sequences based on seeds is an efficient technique used by several algorithms in order to produce other sequencing data generations. Among these algorithms, the 11 consecutive matches of BLAST (Altschul et al., 1990) is called a contiguous seed, denoted as 11111111111 for eleven consecutive matches. It is required to find an identical stretch of length 11 which is not always feasible. In order to increase the probability to find an alignment, PatternHunter II (Li et al., 2004) uses one or several non-contiguous seeds called spaced seeds. Concretely, each spaced seed S is a vector of n elements where n is the length of each seed and their position is defined as follows: S[i] = 1 if the position need required matching and S[i] = 0 if we do not care about position matching, the number of ones in the seed called the weight of the seed. This sensitivity is used to evaluate the quality of a spaced seed for matches alignment. The objective of spaced seed is to increase sensitivity without reducing the computation time performance. The sensitivity is approximated by matching the spaced seeds and a Bernoulli representation of the alignment.
Homology search in biological sequences is a very important task for discovering and understanding similarities among genes and proteins, in order to find similar segments, or local alignments, between two DNA or protein sequences (Altschul et al., 1990). The sizes of DNA and protein databases become very large, such as the EMBL Nucleotide Sequence Database (EMBL-Bank) has increased in size from around 600 entries in 1982 to over 6.2×108 by MARCH 2015, so homology search is very time consuming and far to be done in reasonable time (Altschul et al., 1990).
Any optimisation problem such as finding optimal spaced seeds have two conflicting factors, computation time (searching speed) and solution quality (sensitivity). The aim of all previous methods is to design in reasonable time a good set of spaced seeds having high sensitivity. So there is a tradeoff between the computation process and the sensitivity (Choi et al., 2004). Indeed, we can increase the sensitivity by decreasing the required weight of the hit, nevertheless, the decreasing of the weight of hits will increase the runtime and also increase the number of fallacious hits.
The Penguins search optimisation algorithm (Gheraibia et al., 2013) is a meta-heuristic based on hunting behaviour of penguins used for solving complex problems. The hunting strategy of penguins is more than fascinating since they can collaborate their efforts and synchronise their dives to optimise the global energy in the process of collective hunting.