Knowledge discovery from genomic data has become an important research area for biologists. Nowadays, a lot of data is available on the web, but the corresponding knowledge is not necessarly also available. For example, the first draft of the human genome, which contains 3×109 letters, has been achieved in June 2000, but up to now only a small part of the hidden knowledge has been discovered. The aim of bioinformatics is to bring together biology, computer science, mathematics, statistics and information theory to analyze biological data for interpretation and prediction. Hence many problems encountered while studying genomic data may be modeled as data mining tasks, such as feature selection, classification, clustering, and association rule discovery. An important characteristic of genomic applications is the large amount of data to analyze and it is, most of the time, not possible to enumerate all the possibilities. Therefore, we propose to model these knowledge discovery tasks as combinatorial optimization tasks, in order to apply efficient optimization algorithms to extract knowledge from large datasets. To design an efficient optimization algorithm, several aspects have to be considered. The main one is the choice of the type of resolution method according to the characteristics of the problem. Is it an easy problem, for which a polynomial algorithm may be found? If yes, let us design such an algorithm. Unfortunately, most of the time the response to the question is ‘NO’ and only heuristics, that may find good but not necessarily optimal solutions, can be used. In our approach we focus on evolutionary computation, which has already shown an interesting ability to solve highly complex combinatorial problems. In this chapter, we will show the efficiency of such an approach while describing the main steps required to solve data mining problems from genomics with evolutionary algorithms. We will illustrate these steps with a real problem.
Evolutionary data mining for genomics groups three important fields: Evolutionary computation, knowledge discovery and genomics.
It is now well known that evolutionary algorithms are well suited for some data mining tasks and the reader may refer, for example, to (Freitas, 2008).
Here we want to show the interest of dealing with genomic data using evolutionary approaches. A first proof of this interest may be the book of Gary Fogel and David Corne on « Evolutionary Computation in Bioinformatics » which groups several applications of evolutionary computation to problems in the biological sciences, and in particular in bioinformatics (Corne, Pan, Fogel, 2008). In this book, several data mining tasks are addressed, such as feature selection or clustering, and solved thanks to evolutionary approaches.
Another proof of the interest of such approaches is the number of sessions around “Evolutionary computation in bioinformatics” in congresses on Evolutionary Computation. Man can take as an example, EvoBio, European Workshop on Evolutionary Computation and Machine Learning in Bioinformatics, or the special sessions on “Evolutionary computation in bioinformatics and computational biology” that have been organized during the last Congresses on Evolutionary Computation (CEC’06, CEC’07).
The aim of genomic studies is to understand the function of genes, to determine which genes are involved in a given process and how genes are related. Hence experiments are conducted, for example, to localize coding regions in DNA sequences and/or to evaluate the expression level of genes in certain conditions. Resulting from this, data available for the bioinformatics researcher may not only deal with DNA sequence information but also with other types of data like for example in multi-factorial diseases the Body Mass Index, the sex, and the age. The example used to illustrate this chapter may be classified in this category.